# Feature Construction

Construct analysis-ready covariates/outcomes from preprocessed Crunchbase exports and Companies House enrichment, then create round-specific subsets used in downstream modeling.

**Key objects created in this notebook**
- `df_global`, `df_usa`, `df_uk`, `df_uk_cb_only`: base company-level frames.
- `datasets`: convenience dict to apply transformations across base frames.
- `apples`: filtered company-level frames (comparability/cleanliness) comparing "apples to apples".
- `dfs`: round-level subsets (e.g., `df_usa_seed`, `df_global_pre_seed`) with engineered features.

**Notes**
- This notebook keeps a `freeze_date` to avoid leaking post-window events into the analysis.
- Many cells re-import common libraries so they can be rerun independently in a notebook workflow.


In [1]:

# Purpose: Import core libraries used throughout feature construction.
# NOTE: Many later cells repeat imports so they can be rerun independently.

from pathlib import Path
import pandas as pd
import numpy as np


## Study Parameters

Central configuration for feature construction.

- `freeze_date`: caps event dates (`closed_on`, `went_public_on`, `acquired_on_first`, `first_funding_date`) so anything after this date is treated as not-yet-observed.
- `founding_cohort_bins` / `founding_cohort_labels`: bin founding years into ordered cohorts used for stratification and percentile features.


In [2]:

# Purpose: Central configuration for freeze date, cohort binning, and outcome thresholds.
# NOTE: These parameters are referenced by multiple downstream cells.

STUDY_PARAMS = {
    "preprocessing": {
        "freeze_date": pd.Timestamp("2024-12-31"),
    },
    "covariates": {
        # Added 2021 and 2023 to the bins
        "founding_cohort_bins": [2007, 2009, 2013, 2017],
        # "founding_cohort_bins": [2007, 2009, 2013, 2017, 2020, 2024],
        
        "founding_cohort_labels": [
            "2007-2009", 
            "2010-2013", 
            "2014-2017"
        ],
        #  "founding_cohort_labels": [
        #     "2007-2009", 
        #     "2010-2013", 
        #     "2014-2017", 
        #     "2018-2020",
        #     "2021-2024"
        # ],
    },
    "outcomes": {
        "time_thresholds_months": [12, 36, 84],
    },
}


## Load Base Data

Read the preprocessed Crunchbase exports plus Companies House enrichment for the UK.

Outputs:
- `df_global`: global Crunchbase companies.
- `df_usa`: USA Crunchbase companies.
- `df_uk`: UK Companies House–enriched frame.
- `df_uk_cb_only`: UK Crunchbase-only frame (kept for overlays/diagnostics).


In [3]:

# Purpose: Load preprocessed company-level inputs and keep UK CB-only vs CH-enriched frames separate.
# NOTE: The existence check fails fast if any required input is missing.

DATA_ROOT = Path("/Users/stefan/Desktop/Thesis/v4")
STUDY_DIR = DATA_ROOT / "Study"
CB_PREPROC_DIR = STUDY_DIR / "cb data pre-processing"

paths = {
    "global": CB_PREPROC_DIR / "global_companies.csv",
    "uk_cb_only": CB_PREPROC_DIR / "uk_companies.csv",
    "uk_ch": DATA_ROOT / "Companies House Data" / "uk_funded_ch.csv",
    "usa": CB_PREPROC_DIR / "usa_companies.csv",
}

for label, path in paths.items():
    if not path.exists():
        raise FileNotFoundError(f"Missing required input: {path}")

# Base dataframes
df_global = pd.read_csv(paths["global"], low_memory=False)
df_uk = pd.read_csv(paths["uk_ch"], low_memory=False)  # Companies House enriched
df_usa = pd.read_csv(paths["usa"], low_memory=False)

# UK CB-only frame kept separately for overlays and diagnostics
df_uk_cb_only = pd.read_csv(paths["uk_cb_only"], low_memory=False)


In [4]:
# Purpose: Collect base frames into a single dict so we can loop over datasets consistently.

datasets = {
    "df_global": df_global,
    "df_uk": df_uk,
    "df_usa": df_usa,
    "df_uk_cb_only": df_uk_cb_only,
}

## Fixing status=closed vs status=acquired

Crunchbase records can contain inconsistencies where a company is marked `closed` but also has an acquisition date.

- First cell: diagnose the rows where `status == closed` AND `acquired_on_first` is present.
- Second cell: correct those rows by reclassifying them as `acquired`.


In [5]:
# Purpose: Diagnose inconsistent lifecycle labels (status closed but an acquisition date exists).

import pandas as pd

flagged = {}
for name, df in datasets.items():
    mask = (df["status"].str.lower() == "closed") & df["acquired_on_first"].notna()
    flagged[name] = df.loc[mask]
    print(f"{name}: {mask.sum()} rows with closed status and acquired_on_first filled")

# If you also want to see the actual rows for each df:
for name, rows in flagged.items():
    if not rows.empty:
        print(f"\n{name} flagged rows:")
        print(rows[["status", "acquired_on_first"]].head(20))  # adjust columns/limit as needed


df_global: 12511 rows with closed status and acquired_on_first filled
df_uk: 95 rows with closed status and acquired_on_first filled
df_usa: 7128 rows with closed status and acquired_on_first filled
df_uk_cb_only: 911 rows with closed status and acquired_on_first filled

df_global flagged rows:
     status acquired_on_first
6    closed        2008-12-01
45   closed        2016-03-03
49   closed        2010-01-07
52   closed        2014-02-24
57   closed        2008-05-01
58   closed        2013-12-04
59   closed        2015-10-14
62   closed        2010-08-17
77   closed        2011-04-15
81   closed        2016-04-01
82   closed        2014-09-22
86   closed        2010-10-19
90   closed        2010-09-28
100  closed        2015-04-08
118  closed        2012-06-04
123  closed        2008-12-15
128  closed        2012-03-22
130  closed        2012-09-05
131  closed        2020-04-01
134  closed        2012-03-15

df_uk flagged rows:
     status acquired_on_first
4    closed        2009

In [6]:
# Purpose: Reclassify 'closed' → 'acquired' when an acquisition date is present (within each dataset).

for name, df in datasets.items():
    # Build mask: status is closed (case-insensitive) AND acquired_on_first not null
    mask = df["status"].str.lower().eq("closed") & df["acquired_on_first"].notna()
    # Update status
    df.loc[mask, "status"] = "acquired"
    changed = mask.sum()
    if changed:
        print(f"{name}: updated {changed} rows to 'acquired'")



df_global: updated 12511 rows to 'acquired'
df_uk: updated 95 rows to 'acquired'
df_usa: updated 7128 rows to 'acquired'
df_uk_cb_only: updated 911 rows to 'acquired'


## UK: Overlay Companies House onto Crunchbase

Align the UK Crunchbase-only frame (`df_uk_cb_only`) and the CH-enriched frame (`df_uk`).

Logic:
- Join by `org_uuid`.
- Prefer CH-enriched values when present (fallback to Crunchbase otherwise).
- Apply `freeze_date` to event-date columns to prevent future leakage.


In [7]:

# Purpose: Overlay CH-enriched UK data onto the CB-only UK frame and cap event dates at the freeze date.
# NOTE: `combine_first` prefers CH values and falls back to CB where CH is missing.

freeze_date = STUDY_PARAMS["preprocessing"]["freeze_date"]


def overlay_cb_ch(cb_only: pd.DataFrame, ch_enriched: pd.DataFrame, *, freeze_date: pd.Timestamp) -> tuple[pd.DataFrame, int]:
    required = ["org_uuid"]
    for col in required:
        if col not in cb_only.columns or col not in ch_enriched.columns:
            raise KeyError(f"Missing required column '{col}' in cb_only or ch_enriched")

    cb = cb_only.set_index("org_uuid")
    ch = ch_enriched.set_index("org_uuid")

    all_cols = cb.columns.union(ch.columns)
    cb = cb.reindex(columns=all_cols)
    ch = ch.reindex(columns=all_cols)

    replaced_count = len(cb.index.intersection(ch.index))
    merged = ch.combine_first(cb).reset_index()

    event_cols = ["closed_on", "went_public_on", "acquired_on_first", "first_funding_date"]
    for col in event_cols:
        if col in merged.columns:
            dates = pd.to_datetime(merged[col], errors="coerce")
            merged[col] = dates.mask(dates > freeze_date)

    return merged, replaced_count


df_uk, replaced_rows = overlay_cb_ch(df_uk_cb_only, df_uk, freeze_date=freeze_date)
print(f"Replaced rows for {replaced_rows} org_uuid values; final df_uk shape: {df_uk.shape}")


Replaced rows for 3635 org_uuid values; final df_uk shape: (83103, 118)


## Companies House Status Adjustments

Use Companies House creation/cessation information to improve UK lifecycle variables.

- Update `founded_on` using CH creation dates when they are close enough to the reported founding date.
- Use CH cessation dates to populate `closed_on` (and drop impossible/future cessations).
- Reclassify `status` to `closed` when CH shows dissolution before the freeze date.


In [8]:
# Purpose: Use Companies House creation/cessation signals to improve `founded_on`, `closed_on`, and `status`.
# NOTE: The freeze-date guard prevents post-window cessations from being treated as observed.

# 1. Convert relevant date columns to datetime objects
ch_creation = pd.to_datetime(df_uk.get("ch_date_of_creation"), errors="coerce")
ch_cessation = pd.to_datetime(df_uk.get("ch_date_of_cessation"), errors="coerce")
founded_on = pd.to_datetime(df_uk.get("founded_on"), errors="coerce")

# ---------------------------------------------------------
# NEW LOGIC: Update founded_on using ch_date_of_creation
# ---------------------------------------------------------

# Calculate the absolute difference between the reported founded date and CH creation date
date_diff = (ch_creation - founded_on).abs()

# Define the threshold (1 year approx 365 days)
one_year = pd.Timedelta(days=20000)

# Mask: Both dates exist AND the difference is <= 1 year
update_founded_mask = (
    ch_creation.notna() 
    & founded_on.notna() 
    & (date_diff <= one_year)
)

# Update the main DataFrame
df_uk.loc[update_founded_mask, "founded_on"] = ch_creation.loc[update_founded_mask]

count_founded_updated = update_founded_mask.sum()
print(f"Founded dates updated using CH creation date (within 1 year difference): {count_founded_updated:,}")

# ---------------------------------------------------------
# REFRESH founded_on VARIABLE
# ---------------------------------------------------------
# We must reload this series because we just modified the DataFrame, 
# and the subsequent cessation logic relies on the *updated* founded_on values.
founded_on = pd.to_datetime(df_uk.get("founded_on"), errors="coerce")


# ---------------------------------------------------------
# EXISTING LOGIC: Cessation and Status
# ---------------------------------------------------------

valid_cessation = ch_cessation.notna() & (ch_cessation <= freeze_date)
future_cessation = ch_cessation.notna() & (ch_cessation > freeze_date)

# This now uses the UPDATED founded_on variable
impossible_cessation = valid_cessation & founded_on.notna() & (ch_cessation < founded_on)

# Update closed_on where the CH date is usable
usable = valid_cessation & ~impossible_cessation
df_uk.loc[usable, "closed_on"] = ch_cessation

# Drop impossible or future cessation dates
invalid_mask = future_cessation | impossible_cessation
df_uk.loc[invalid_mask, "closed_on"] = pd.NaT

print(f"Cessation dates > freeze_date dropped: {future_cessation.sum():,}")
print(f"Cessation before founded_on dropped: {impossible_cessation.sum():,}")

ch_status = df_uk.get("ch_company_status", pd.Series(index=df_uk.index)).astype(str).str.lower()
status_lower = df_uk.get("status", pd.Series(index=df_uk.index)).astype(str).str.lower()
closed_statuses = {"dissolved"}

mask_reclass = (
    status_lower.eq("operating")
    & ch_status.isin(closed_statuses)
    & valid_cessation
    & ~impossible_cessation
)

updated_count = mask_reclass.sum()
df_uk.loc[mask_reclass, "status"] = "closed"
print(f"Reclassified to closed (operating vs CH closed status, cessation <= freeze): {updated_count:,}")

missing_cessation_mask = ch_status.isin(closed_statuses) & ch_cessation.isna()
print(f"Rows with CH closed status but no ch_date_of_cessation: {missing_cessation_mask.sum():,}")

Founded dates updated using CH creation date (within 1 year difference): 3,676
Cessation dates > freeze_date dropped: 145
Cessation before founded_on dropped: 0
Reclassified to closed (operating vs CH closed status, cessation <= freeze): 481
Rows with CH closed status but no ch_date_of_cessation: 0


## Checking Round-Date Logic

Sanity-check that funding stage dates occur in a sensible order (e.g., Seed should not systematically appear after Series A).

The helper computes pairwise ordering statistics across the stage date columns to identify inconsistencies and potential data-quality issues.


In [9]:
# Purpose: Quantify whether funding-stage date columns follow the expected chronology (pairwise ordering stats).

import pandas as pd

stages = [
    "date_pre_seed",
    "date_seed",
    "date_series_a",
    "date_series_b",
    "date_series_c",
]

datasets = {
    "df_global": df_global,
    "df_uk": df_uk,
    "df_usa": df_usa,
    "df_uk_cb_only": df_uk_cb_only,
}

def pairwise_ordering(datasets, stages):
    out = {}
    for name, df in datasets.items():
        rows = []
        for i, earlier in enumerate(stages):
            for later in stages[i + 1:]:
                pair = df[[earlier, later]].dropna()
                if pair.empty:
                    continue
                total = len(pair)
                earlier_first = (pair[earlier] < pair[later]).sum()
                later_first = (pair[later] < pair[earlier]).sum()
                ties = total - earlier_first - later_first
                rows.append(
                    {
                        "earlier": earlier,
                        "later": later,
                        "n_pairs": total,
                        "earlier_first_pct": earlier_first / total,
                        "later_first_pct": later_first / total,
                        "tie_pct": ties / total,
                    }
                )
        out[name] = pd.DataFrame(rows)
    return out

ordering = pairwise_ordering(datasets, stages)
for name, df in ordering.items():
    print(f"\n{name}")
    display(df.sort_values(["earlier", "later"]))



df_global


Unnamed: 0,earlier,later,n_pairs,earlier_first_pct,later_first_pct,tie_pct
0,date_pre_seed,date_seed,6130,0.998858,0.0,0.001142
1,date_pre_seed,date_series_a,2146,1.0,0.0,0.0
2,date_pre_seed,date_series_b,876,1.0,0.0,0.0
3,date_pre_seed,date_series_c,338,1.0,0.0,0.0
4,date_seed,date_series_a,14782,0.999053,0.0,0.000947
5,date_seed,date_series_b,6873,1.0,0.0,0.0
6,date_seed,date_series_c,2944,1.0,0.0,0.0
7,date_series_a,date_series_b,13378,0.999626,0.0,0.000374
8,date_series_a,date_series_c,5841,1.0,0.0,0.0
9,date_series_b,date_series_c,6304,0.999841,0.0,0.000159



df_uk


Unnamed: 0,earlier,later,n_pairs,earlier_first_pct,later_first_pct,tie_pct
0,date_pre_seed,date_seed,500,0.998,0.0,0.002
1,date_pre_seed,date_series_a,153,1.0,0.0,0.0
2,date_pre_seed,date_series_b,51,1.0,0.0,0.0
3,date_pre_seed,date_series_c,16,1.0,0.0,0.0
4,date_seed,date_series_a,975,0.998974,0.0,0.001026
5,date_seed,date_series_b,385,1.0,0.0,0.0
6,date_seed,date_series_c,124,1.0,0.0,0.0
7,date_series_a,date_series_b,567,1.0,0.0,0.0
8,date_series_a,date_series_c,209,1.0,0.0,0.0
9,date_series_b,date_series_c,225,1.0,0.0,0.0



df_usa


Unnamed: 0,earlier,later,n_pairs,earlier_first_pct,later_first_pct,tie_pct
0,date_pre_seed,date_seed,2480,0.998387,0.0,0.001613
1,date_pre_seed,date_series_a,968,1.0,0.0,0.0
2,date_pre_seed,date_series_b,442,1.0,0.0,0.0
3,date_pre_seed,date_series_c,176,1.0,0.0,0.0
4,date_seed,date_series_a,7097,0.998873,0.0,0.001127
5,date_seed,date_series_b,3526,1.0,0.0,0.0
6,date_seed,date_series_c,1617,1.0,0.0,0.0
7,date_series_a,date_series_b,6152,0.999837,0.0,0.000163
8,date_series_a,date_series_c,2959,1.0,0.0,0.0
9,date_series_b,date_series_c,3151,0.999683,0.0,0.000317



df_uk_cb_only


Unnamed: 0,earlier,later,n_pairs,earlier_first_pct,later_first_pct,tie_pct
0,date_pre_seed,date_seed,500,0.998,0.0,0.002
1,date_pre_seed,date_series_a,153,1.0,0.0,0.0
2,date_pre_seed,date_series_b,51,1.0,0.0,0.0
3,date_pre_seed,date_series_c,16,1.0,0.0,0.0
4,date_seed,date_series_a,975,0.998974,0.0,0.001026
5,date_seed,date_series_b,385,1.0,0.0,0.0
6,date_seed,date_series_c,124,1.0,0.0,0.0
7,date_series_a,date_series_b,567,1.0,0.0,0.0
8,date_series_a,date_series_c,209,1.0,0.0,0.0
9,date_series_b,date_series_c,225,1.0,0.0,0.0


## Time to First Million

Compute a simple duration outcome (`time_to_1_mil`) from founding to reaching the first million milestone.

- Adds `time_to_1_mil` (in months).
- Prints summary statistics and missingness by dataset.


In [10]:

# Purpose: Add `time_to_1_mil` (months) and print summary stats for diagnostics across datasets.
# NOTE: `globals().update(datasets)` exposes updated frames as top-level variables.

def add_time_to_1_mil(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["founded_on"] = pd.to_datetime(df["founded_on"], errors="coerce")
    df["date_of_1_million"] = pd.to_datetime(df.get("date_of_1_million"), errors="coerce")
    start, end = df["founded_on"], df["date_of_1_million"]
    months = (end.dt.year - start.dt.year) * 12 + (end.dt.month - start.dt.month)
    months -= (end.dt.day < start.dt.day).astype(int)
    df["time_to_1_mil"] = months
    return df


def summarize_time_to_1_mil(df: pd.DataFrame, name: str) -> None:
    s = df["time_to_1_mil"]
    desc = s.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9])
    print(f"{name}")
    print(desc[["count", "mean", "std", "min", "10%", "25%", "50%", "75%", "90%", "max"]])
    missing = s.isna().sum()
    print(f"missing: {missing} ({missing/len(df):.1%})")
    print(f"non-positive: {(s <= 0).sum()}")


for key, frame in datasets.items():
    datasets[key] = add_time_to_1_mil(frame)
    summarize_time_to_1_mil(datasets[key], key)

globals().update(datasets)


df_global
count    70990.000000
mean        48.296507
std         40.454913
min       -206.000000
10%          8.000000
25%         18.000000
50%         38.000000
75%         68.000000
90%        105.000000
max        222.000000
Name: time_to_1_mil, dtype: float64
missing: 1068728 (93.8%)
non-positive: 1867
df_uk
count    4814.000000
mean       52.741795
std        44.297265
min      -204.000000
10%         8.300000
25%        22.000000
50%        44.000000
75%        76.000000
90%       110.700000
max       427.000000
Name: time_to_1_mil, dtype: float64
missing: 78289 (94.2%)
non-positive: 175
df_usa
count    30904.000000
mean        41.897133
std         38.218641
min       -206.000000
10%          6.000000
25%         15.000000
50%         31.000000
75%         58.000000
90%         95.000000
max        221.000000
Name: time_to_1_mil, dtype: float64
missing: 386215 (92.6%)
non-positive: 948
df_uk_cb_only
count    4814.000000
mean       54.200665
std        41.156066
min      -204.0

## Sector Classification

Map Crunchbase `category_groups_list` into a consolidated, analysis-friendly `sector` variable.

- Converts multiple category-groups per company into candidate sectors.
- Chooses a primary sector by frequency with a deterministic priority order.


In [11]:

# Purpose: Map Crunchbase category groups to a single consolidated `sector` for analysis.
# NOTE: Ties are broken via a fixed priority list so results are deterministic.

from collections import Counter

CATEGORY_GROUP_TO_SECTOR = {
    "Administrative Services": "Professional & Business Services",
    "Advertising": "Marketing & Advertising",
    "Agriculture and Farming": "Food & Agriculture",
    "Apps": "Software & SaaS",
    "Artificial Intelligence (AI)": "IT & Data Infrastructure",
    "Biotechnology": "Health & Life Sciences",
    "Blockchain and Cryptocurrency": "Financial Services & Fintech",
    "Clothing and Apparel": "Retail & E-Commerce",
    "Commerce and Shopping": "Retail & E-Commerce",
    "Community and Lifestyle": "Sports & Lifestyle",
    "Consumer Electronics": "Hardware & Devices",
    "Consumer Goods": "Retail & E-Commerce",
    "Content and Publishing": "Media & Entertainment",
    "Data and Analytics": "IT & Data Infrastructure",
    "Design": "Professional & Business Services",
    "Education": "Education",
    "Energy": "Energy & Sustainability",
    "Events": "Marketing & Advertising",
    "Financial Services": "Financial Services & Fintech",
    "Food and Beverage": "Food & Agriculture",
    "Gaming": "Media & Entertainment",
    "Government and Military": "Government & Social Impact",
    "Hardware": "Hardware & Devices",
    "Health Care": "Health & Life Sciences",
    "Information Technology": "IT & Data Infrastructure",
    "Internet Services": "Software & SaaS",
    "Lending and Investments": "Financial Services & Fintech",
    "Manufacturing": "Industrial & Manufacturing",
    "Media and Entertainment": "Media & Entertainment",
    "Messaging and Telecommunications": "IT & Data Infrastructure",
    "Mobile": "Software & SaaS",
    "Music and Audio": "Media & Entertainment",
    "Natural Resources": "Energy & Sustainability",
    "Navigation and Mapping": "Transportation & Mobility",
    "Other": "Other",
    "Payments": "Financial Services & Fintech",
    "Platforms": "Software & SaaS",
    "Privacy and Security": "Security",
    "Professional Services": "Professional & Business Services",
    "Real Estate": "Real Estate & Construction",
    "Sales and Marketing": "Marketing & Advertising",
    "Science and Engineering": "Industrial & Manufacturing",
    "Social Impact": "Government & Social Impact",
    "Software": "Software & SaaS",
    "Sports": "Sports & Lifestyle",
    "Sustainability": "Energy & Sustainability",
    "Transportation": "Transportation & Mobility",
    "Travel and Tourism": "Travel & Hospitality",
    "Video": "Media & Entertainment",
}

MERGE_TO_OTHER = {
    "Education",
    "Other",
    "Transportation & Mobility",
    "Food & Agriculture",
    "Travel & Hospitality",
    "Sports & Lifestyle",
    "Government & Social Impact",
    "Security",
    "Industrial & Manufacturing",
    "Real Estate & Construction",
    "Hardware & Devices",
    "Energy & Sustainability",
}

SECTOR_PRIORITY = [
    "Software & SaaS",
    "IT & Data Infrastructure",
    "Hardware & Devices",
    "Financial Services & Fintech",
    "Health & Life Sciences",
    "Media & Entertainment",
    "Marketing & Advertising",
    "Retail & E-Commerce",
    "Industrial & Manufacturing",
    "Energy & Sustainability",
    "Real Estate & Construction",
    "Professional & Business Services",
    "Other",
]


def _split_list(value: str) -> list[str]:
    if not isinstance(value, str) or not value:
        return []
    return [item.strip() for item in value.split(",") if item.strip()]


def _candidate_sector_counts(row: pd.Series) -> Counter:
    counts = Counter()
    for group in _split_list(row.get("category_groups_list", "")):
        sector = CATEGORY_GROUP_TO_SECTOR.get(group)
        if sector:
            if sector in MERGE_TO_OTHER:
                sector = "Other"
            counts[sector] += 1
    return counts


def _choose_sector(counts: Counter) -> str:
    if not counts:
        return "Other"
    max_count = max(counts.values())
    candidates = [sector for sector, count in counts.items() if count == max_count]
    for preferred in SECTOR_PRIORITY:
        if preferred in candidates:
            return preferred
    return candidates[0]


def assign_primary_sector(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["sector"] = (
        df.apply(lambda row: _choose_sector(_candidate_sector_counts(row)), axis=1)
        .astype("category")
    )
    return df


def show_sector_share(df: pd.DataFrame, name: str) -> None:
    share = (
        df["sector"]
        .value_counts(normalize=True, dropna=False)
        .mul(100)
        .round(2)
    )
    print(f"{name}:")
    print(share.to_string())
    print()


for key, frame in datasets.items():
    datasets[key] = assign_primary_sector(frame)
    show_sector_share(datasets[key], key)

globals().update(datasets)


df_global:
sector
Other                               28.85
Software & SaaS                     22.29
Media & Entertainment                8.02
Financial Services & Fintech         7.71
Health & Life Sciences               6.86
Retail & E-Commerce                  6.68
IT & Data Infrastructure             6.67
Marketing & Advertising              6.48
Professional & Business Services     6.43

df_uk:
sector
Other                               27.23
Software & SaaS                     20.19
Media & Entertainment                9.75
Financial Services & Fintech         9.25
Professional & Business Services     8.23
Marketing & Advertising              7.32
IT & Data Infrastructure             6.43
Retail & E-Commerce                  6.20
Health & Life Sciences               5.42

df_usa:
sector
Other                               28.54
Software & SaaS                     19.54
Health & Life Sciences               9.99
Financial Services & Fintech         9.60
Media & Entertainment      

## Founding Cohort

Bin `founded_on` (year) into ordered cohorts used as key stratification controls.

This produces:
- `found_year`: numeric year extracted from `founded_on`.
- `founding_cohort`: ordered categorical cohort label.


In [12]:

# Purpose: Assign each company into an ordered founding cohort based on `founded_on` year.

covariate_cfg = STUDY_PARAMS["covariates"]
cohort_bins = covariate_cfg["founding_cohort_bins"]
cohort_labels = covariate_cfg["founding_cohort_labels"]
cohort_dtype = pd.CategoricalDtype(cohort_labels, ordered=True)


def assign_founding_cohort(df: pd.DataFrame) -> pd.DataFrame:
    if "founded_on" not in df.columns:
        raise KeyError("Missing required column 'founded_on' for cohort assignment.")

    out = df.copy()
    out["found_year"] = pd.to_datetime(out["founded_on"], errors="coerce").dt.year
    out["founding_cohort"] = pd.cut(
        out["found_year"],
        bins=cohort_bins,
        labels=cohort_labels,
        right=True,
        include_lowest=True,
    ).astype(cohort_dtype)
    return out


for key, frame in datasets.items():
    datasets[key] = assign_founding_cohort(frame)

globals().update(datasets)


## Founder Education

Collapse multiple founder degree indicators into a single ordered categorical feature `founder_education`.

The inference rule:
- If all degree flags are missing → `No higher education`.
- Otherwise, assign the highest-priority degree present (PhD/JD → MBA → Master → Bachelor → Other).


In [13]:
# Purpose: Infer a single `founder_education` category from multiple founder degree indicator columns.

degree_cols = [
    "founders_has_phd",
    "founders_has_jd",
    "founders_has_mba",
    "founders_has_masters",
    "founders_has_bachelors",
]

founder_edu_dtype = pd.CategoricalDtype(
    ["No higher education", "PhD", "MBA", "Master", "Bachelor", "Other"], ordered=True
)


def _is_truthy(val) -> bool:
    if pd.isna(val):
        return False
    if isinstance(val, str):
        normalized = val.strip().lower()
        if normalized in {"", "nan", "none"}:
            return False
        return normalized in {"1", "true", "yes", "y"}
    return bool(val)


def infer_founder_edu(row: pd.Series) -> object:
    values = row[degree_cols]
    if values.isna().all():
        return "No higher education"

    for label, col in [
        ("PhD", "founders_has_phd"),
        ("PhD", "founders_has_jd"),
        ("MBA", "founders_has_mba"),
        ("Master", "founders_has_masters"),
        ("Bachelor", "founders_has_bachelors"),
    ]:
        val = row[col]
        if pd.isna(val):
            continue
        if _is_truthy(val):
            return label

    return "Other"


def assign_founder_education(df: pd.DataFrame) -> pd.DataFrame:
    missing = [col for col in degree_cols if col not in df.columns]
    if missing:
        raise KeyError(f"Missing required columns: {missing}")

    result = df.copy()
    result["founder_education"] = (
        result.apply(infer_founder_edu, axis=1).astype(founder_edu_dtype)
    )
    return result


def show_founder_education_pct(df: pd.DataFrame, name: str) -> None:
    if "founder_education" not in df.columns:
        print(f"{name}: missing founder_education column")
        return

    pct = (
        df["founder_education"]
        .value_counts(normalize=True, dropna=False)
        .mul(100)
        .sort_values(ascending=False)
        .round(2)
    )

    print(f"{name}:")
    print(pct.to_string())
    print()


for key, frame in datasets.items():
    datasets[key] = assign_founder_education(frame)
    show_founder_education_pct(datasets[key], key)

globals().update(datasets)


df_global:
founder_education
No higher education    85.83
Other                   5.20
Bachelor                4.09
Master                  2.27
MBA                     1.79
PhD                     0.81

df_uk:
founder_education
No higher education    86.99
Other                   5.40
Bachelor                3.49
Master                  2.31
MBA                     1.15
PhD                     0.67

df_usa:
founder_education
No higher education    80.77
Bachelor                6.75
Other                   6.17
MBA                     2.52
Master                  2.49
PhD                     1.30

df_uk_cb_only:
founder_education
No higher education    86.99
Other                   5.40
Bachelor                3.49
Master                  2.31
MBA                     1.15
PhD                     0.67



## Team Size and Diversity

Construct founding-team composition covariates:

- `founding_team_size`: binned from `founders_count` into `1`, `2`, `3+`.
- `founding_team_diversity`: categorical label combining gender diversity and nationality diversity.


In [14]:

# Purpose: Bin `founders_count` into an ordered founding team size category (1, 2, 3+).

TEAM_SIZE_CATEGORIES = ["1", "2", "3+"]
team_size_dtype = pd.CategoricalDtype(TEAM_SIZE_CATEGORIES, ordered=True)


def bin_team_size(value) -> object:
    if pd.isna(value):
        return pd.NA
    try:
        size = int(value)
    except (TypeError, ValueError):
        return pd.NA
    if size <= 0:
        return pd.NA
    return "3+" if size >= 3 else str(size)


def apply_team_size_binning(df: pd.DataFrame) -> pd.DataFrame:
    if "founders_count" not in df.columns:
        raise KeyError("Missing founders_count column required for founding_team_size.")
    result = df.copy()
    result["founding_team_size"] = (
        result["founders_count"]
        .apply(bin_team_size)
        .astype(team_size_dtype)
    )
    return result


def show_team_size_pct(df: pd.DataFrame, name: str) -> None:
    if "founding_team_size" not in df.columns:
        print(f"{name}: missing founding_team_size column")
        return

    pct = (
        df["founding_team_size"]
        .astype("object")
        .value_counts(dropna=False, normalize=True)
        .mul(100)
        .round(2)
    )
    pct.index = pct.index.fillna("NaN")
    pct = pct.reindex(["NaN"] + TEAM_SIZE_CATEGORIES, fill_value=0)

    print(f"{name}:")
    print(pct.to_string())
    print()


for key, frame in datasets.items():
    datasets[key] = apply_team_size_binning(frame)
    show_team_size_pct(datasets[key], key)

globals().update(datasets)


df_global:
founding_team_size
NaN    74.51
1      17.46
2       5.75
3+      2.28

df_uk:
founding_team_size
NaN    74.93
1      17.38
2       5.88
3+      1.81

df_usa:
founding_team_size
NaN    69.96
1      20.29
2       7.00
3+      2.75

df_uk_cb_only:
founding_team_size
NaN    74.92
1      17.39
2       5.88
3+      1.81



In [15]:

# Purpose: Construct a founding-team diversity label combining gender and nationality diversity signals.

DIVERSITY_LABELS = [
    "solo founder",
    "not diverse",
    "diverse in founder genders, not in nationalities",
    "diverse in founder genders and in nationalities",
    "diverse in nationalities, not in founder genders",
]
diversity_dtype = pd.CategoricalDtype(DIVERSITY_LABELS, ordered=False)


def _split_countries(value) -> list[str]:
    if isinstance(value, str):
        tokens = [part.strip() for part in value.split("|") if part.strip()]
        return [tok for tok in tokens if tok.lower() not in {"nan", "none"}]
    if isinstance(value, (list, tuple, set)):
        return [str(part).strip() for part in value if str(part).strip()]
    return []


def assign_founding_team_diversity(df: pd.DataFrame) -> pd.DataFrame:
    required_cols = {"founders_countries", "founders_female_count", "founders_male_count"}
    missing = required_cols - set(df.columns)
    if missing:
        raise KeyError(f"Missing required columns: {sorted(missing)}")

    result = df.copy()

    result["_female_count"] = pd.to_numeric(result["founders_female_count"], errors="coerce")
    result["_male_count"] = pd.to_numeric(result["founders_male_count"], errors="coerce")
    result["_female_count"] = result["_female_count"].mask(result["_female_count"] < 0)
    result["_male_count"] = result["_male_count"].mask(result["_male_count"] < 0)

    def calc_diversity(row: pd.Series):
        if pd.isna(row["founders_countries"]) or pd.isna(row["_female_count"]) or pd.isna(row["_male_count"]):
            return pd.NA

        total_founders = row["_female_count"] + row["_male_count"]
        if total_founders == 1:
            return "solo founder"

        unique_countries = len(set(_split_countries(row["founders_countries"])))
        gender_diverse = (row["_female_count"] >= 1) and (row["_male_count"] >= 1)

        if unique_countries < 2 and not gender_diverse:
            return "not diverse"
        if unique_countries < 2 and gender_diverse:
            return "diverse in founder genders, not in nationalities"
        if unique_countries >= 2 and gender_diverse:
            return "diverse in founder genders and in nationalities"
        return "diverse in nationalities, not in founder genders"

    diversity = result.apply(calc_diversity, axis=1)
    result["founding_team_diversity"] = pd.Categorical(diversity, categories=DIVERSITY_LABELS, ordered=False)

    result.drop(columns=["_female_count", "_male_count"], inplace=True)
    return result


def show_diversity_pct(df: pd.DataFrame, name: str) -> None:
    if "founding_team_diversity" not in df.columns:
        print(f"{name}: missing founding_team_diversity column")
        return

    pct = (
        df["founding_team_diversity"]
        .astype("object")
        .value_counts(dropna=False, normalize=True)
        .mul(100)
        .round(2)
    )
    pct.index = pct.index.fillna("NaN")
    pct = pct.reindex(["NaN"] + DIVERSITY_LABELS, fill_value=0)

    print(f"{name}:")
    print(pct.to_string())
    print()


for key, frame in datasets.items():
    datasets[key] = assign_founding_team_diversity(frame)
    show_diversity_pct(datasets[key], key)

globals().update(datasets)


df_global:
founding_team_diversity
NaN                                                 76.88
solo founder                                        15.34
not diverse                                          5.17
diverse in founder genders, not in nationalities     1.42
diverse in founder genders and in nationalities      0.24
diverse in nationalities, not in founder genders     0.94

df_uk:
founding_team_diversity
NaN                                                 76.60
solo founder                                        15.94
not diverse                                          4.51
diverse in founder genders, not in nationalities     1.33
diverse in founder genders and in nationalities      0.31
diverse in nationalities, not in founder genders     1.31

df_usa:
founding_team_diversity
NaN                                                 73.02
solo founder                                        17.63
not diverse                                          6.38
diverse in founder genders, no

## Parent Organization Backing

(Disabled / archived feature)

This section contains an earlier attempt at constructing a `parent_backed` indicator using an org-parent mapping file. The code is intentionally kept (commented) for provenance and possible later revival.


In [16]:

# Purpose: (Commented-out) Archived parent-organization backing feature; kept for provenance.
# NOTE: This block is intentionally not executed in the current pipeline.

# from pathlib import Path

# parent_backed_dtype = pd.CategoricalDtype(["Yes"], ordered=False)

# PARENT_MAP_PATH = Path("/Users/stefan/Desktop/Thesis/v4/Crunchbase Data/bulk_export/org_parents.csv")
# parent_lookup = (
#     pd.read_csv(PARENT_MAP_PATH, usecols=["parent_uuid", "parent_name"])
#       .dropna(subset=["parent_uuid", "parent_name"])
#       .drop_duplicates(subset=["parent_uuid"])
#       .set_index("parent_uuid")["parent_name"]
# )


# def assign_parent_backed(df: pd.DataFrame) -> pd.DataFrame:
#     required = {"parent_uuid", "acquired_on_first"}
#     missing = required.difference(df.columns)
#     if missing:
#         raise KeyError(f"Missing required columns for parent_backed: {missing}")

#     result = df.copy()

#     parent_uuid = result["parent_uuid"].astype("string")
#     result["parent_name"] = parent_uuid.map(parent_lookup).astype("string")

#     parent_name = result["parent_name"].str.strip()
#     has_parent_name = parent_name.notna() & parent_name.ne("")

#     acquired_first = result["acquired_on_first"].astype("string").str.strip()
#     acquired_missing = acquired_first.isna() | acquired_first.eq("")

#     mask = has_parent_name & acquired_missing

#     result["parent_backed"] = pd.Series(pd.NA, index=result.index, dtype="string")
#     result.loc[mask, "parent_backed"] = "Yes"
#     result["parent_backed"] = result["parent_backed"].astype(parent_backed_dtype)

#     return result


# def show_parent_backed_pct(df: pd.DataFrame, name: str) -> None:
#     if "parent_backed" not in df.columns:
#         print(f"{name}: missing parent_backed column
# ")
#         return

#     pct = (
#         df["parent_backed"]
#         .value_counts(normalize=True, dropna=False)
#         .mul(100)
#         .round(2)
#     )
#     pct.index = pct.index.fillna("NaN")
#     pct = pct.sort_index()

#     print(f"{name}:")
#     print(pct.to_string())
#     print()


# for key, frame in datasets.items():
#     datasets[key] = assign_parent_backed(frame)
#     show_parent_backed_pct(datasets[key], key)

# globals().update(datasets)


## Founder University Reputation

Extract university names from the raw `founders_degrees` text field and map them into an ordered reputation bucket.

Sources used:
- Times Higher Education (THE) top-200 list.
- Financial Times (FT) business-school ranking.

The feature is designed to be robust to spelling variants and common aliases.


In [17]:
# Purpose: Build `founder_uni_reputation` by extracting institutions from raw degree text and ranking them.
# NOTE: Uses THE/FT files plus aliasing to normalize common spelling variants.

import re
from pathlib import Path
from typing import Optional

# --- University reputation feature -----------------------------------------
UNI_RANKINGS_DIR = Path("/Users/stefan/Desktop/Thesis/v2/Uni Rankings")
THE_TOP200_PATH = UNI_RANKINGS_DIR / "Top_Universities_THE.xlsx"
FT_RANKING_PATH = UNI_RANKINGS_DIR / "FT_uni_ranking.xlsx"

for required_path in (THE_TOP200_PATH, FT_RANKING_PATH):
    if not required_path.exists():
        raise FileNotFoundError(f"Missing ranking file at {required_path}")

def clean_spaces(value: str) -> str:
    return re.sub(r"\s+", " ", str(value)).strip()

def canon_key(value: str) -> str:
    key = re.sub(r"[^a-z0-9 ]+", "", value.lower())
    return re.sub(r"\s+", " ", key).strip()

ALIASES = {
    "uc berkeley": "University of California, Berkeley",
    "university of california berkeley": "University of California, Berkeley",
    "ucla": "University of California, Los Angeles",
    "university of california los angeles": "University of California, Los Angeles",
    "mit": "Massachusetts Institute of Technology",
    "massachusetts institute of technology mit": "Massachusetts Institute of Technology",
    "caltech": "California Institute of Technology",
    "lse": "London School of Economics and Political Science",
    "london school of economics": "London School of Economics and Political Science",
    "university of michigan ann arbor": "University of Michigan, Ann Arbor",
    "university of michiganann arbor": "University of Michigan, Ann Arbor",
    "u michigan": "University of Michigan, Ann Arbor",
    "ucl": "UCL",
    "university college london": "UCL",
    "kings college london": "King's College London",
}

def canonicalize(name: str) -> str:
    base = clean_spaces(name)
    return ALIASES.get(canon_key(base), base)

PAREN_FIRST_FIELD = re.compile(r"\(([^,()]+),")
AT_UNI_PATTERN = re.compile(r"@ ([^()|]+)")

def extract_institutions(text) -> list[str]:
    if pd.isna(text):
        return []
    institutions = [match.group(1).strip() for match in PAREN_FIRST_FIELD.finditer(str(text))]
    institutions.extend(match.group(1).strip() for match in AT_UNI_PATTERN.finditer(str(text)))
    unique, seen = [], set()
    for inst in institutions:
        canonical = canonicalize(inst)
        if canonical and canonical not in seen:
            unique.append(canonical)
            seen.add(canonical)
    return unique

IVY_LEAGUE = {
    "Brown University", "Columbia University", "Cornell University", "Dartmouth College",
    "Harvard University", "University of Pennsylvania", "Princeton University", "Yale University",
}
TOP_US_EXTRA = {
    "Stanford University", "Massachusetts Institute of Technology", "University of Chicago",
    "Northwestern University", "Duke University", "Johns Hopkins University",
    "California Institute of Technology", "University of California, Berkeley",
    "University of California, Los Angeles", "University of Michigan, Ann Arbor",
}
TOP_US_UNIS = {canonicalize(name) for name in (IVY_LEAGUE | TOP_US_EXTRA)}

TOP_UK_UNIS = {
    canonicalize(name)
    for name in {
        "University of Oxford",
        "University of Cambridge",
        "Imperial College London",
        "UCL",
        "University of Edinburgh",
        "King's College London",
        "London School of Economics and Political Science",
        "University of Manchester",
        "University of Bristol",
        "University of Glasgow",
        "University of Birmingham",
        "University of Sheffield",
        "University of Leeds",
        "University of Warwick",
        "University of Southampton",
        "Queen Mary University of London",
        "University of Liverpool",
        "Newcastle University",
        "University of Nottingham",
        "University of York",
    }
}

def _standardize_columns(df: pd.DataFrame) -> pd.DataFrame:
    renamed = df.copy()
    renamed.columns = [col.strip().lower() for col in renamed.columns]
    return renamed

def _read_excel(path: Path) -> pd.DataFrame:
    return pd.read_excel(path)

def load_the_top200(path: Path) -> pd.DataFrame:
    the_df = _standardize_columns(_read_excel(path))
    name_col = next(
        (col for col in ["institution", "university", "university name", "name", "school", "school name"] if col in the_df.columns),
        next((col for col in the_df.columns if "name" in col), None),
    )
    if name_col is None:
        raise ValueError("THE file: could not find a column containing university names.")
    rank_col = next((col for col in the_df.columns if "rank" in col), None)
    if rank_col is None:
        the_df["rank"] = pd.NA
        rank_col = "rank"
    the_df = the_df[[name_col, rank_col]].rename(columns={name_col: "institution", rank_col: "rank"})
    the_df["institution_can"] = the_df["institution"].astype(str).apply(canonicalize)
    the_df["rank_num"] = pd.to_numeric(the_df["rank"], errors="coerce")
    return the_df

def load_ft_ranking(path: Path) -> pd.DataFrame:
    ft_df = _standardize_columns(_read_excel(path))
    name_col = next(
        (col for col in ["school name", "institution", "university", "school", "name"] if col in ft_df.columns),
        next((col for col in ft_df.columns if "name" in col), None),
    )
    if name_col is None:
        raise ValueError("FT file: could not find a column containing school names.")
    rank_col = next((col for col in ft_df.columns if "rank" in col), None)
    if rank_col is None:
        ft_df["rank"] = pd.NA
        rank_col = "rank"
    ft_df = ft_df[[name_col, rank_col]].rename(columns={name_col: "institution", rank_col: "rank"})
    ft_df["institution_can"] = ft_df["institution"].astype(str).apply(canonicalize)
    ft_df["rank_num"] = pd.to_numeric(ft_df["rank"], errors="coerce")
    return ft_df

the_top200 = load_the_top200(THE_TOP200_PATH)
ft_rank = load_ft_ranking(FT_RANKING_PATH)

THE_SET = set(the_top200["institution_can"])
FT_SET = set(ft_rank["institution_can"])
THE_RANK_MAP = dict(zip(the_top200["institution_can"], the_top200["rank_num"]))
FT_RANK_MAP = dict(zip(ft_rank["institution_can"], ft_rank["rank_num"]))

PRIORITY_ORDER = {"top_uk": 1, "top_us": 2, "ft_top": 3, "top200_the": 4, "other": 5}

def classify_institution(name: str):
    canonical = canonicalize(name)
    if canonical in TOP_UK_UNIS:
        return ("top_uk", PRIORITY_ORDER["top_uk"], 0.0)
    if canonical in TOP_US_UNIS:
        return ("top_us", PRIORITY_ORDER["top_us"], 0.0)
    if canonical in FT_SET:
        rank = FT_RANK_MAP.get(canonical, float("inf"))
        rank = rank if pd.notna(rank) else float("inf")
        return ("ft_top", PRIORITY_ORDER["ft_top"], rank)
    if canonical in THE_SET:
        rank = THE_RANK_MAP.get(canonical, float("inf"))
        rank = rank if pd.notna(rank) else float("inf")
        return ("top200_the", PRIORITY_ORDER["top200_the"], rank)
    return ("other", PRIORITY_ORDER["other"], float("inf"))

def choose_best(institutions: list[str]) -> Optional[str]:
    if not institutions:
        return None
    best_key, best_category = None, "other"
    for institution in institutions:
        category, priority, rank = classify_institution(institution)
        key = (priority, rank, canonicalize(institution))
        if best_key is None or key < best_key:
            best_key, best_category = key, category
    return best_category

LABEL_MAP = {
    "top_uk": "Top UK",
    "top_us": "Top US",
    "ft_top": "FT top business school",
    "top200_the": "THE top 200",
    "other": "Other",
}
FOUNDER_UNI_LABELS = ["Top UK", "Top US", "FT top business school", "THE top 200", "Other", "No higher education"]
founder_uni_dtype = pd.CategoricalDtype(FOUNDER_UNI_LABELS, ordered=True)

def assign_founder_uni_reputation(df: pd.DataFrame) -> pd.DataFrame:
    if "founders_degrees" not in df.columns:
        raise KeyError("Missing required column 'founders_degrees' for university reputation.")
    result = df.copy()
    institutions_series = result["founders_degrees"].apply(extract_institutions)
    best_categories = institutions_series.apply(choose_best)
    founder_uni_reputation = best_categories.map(LABEL_MAP)
    if "founder_education" in result.columns:
        no_uni_mask = best_categories.isna()
        founder_education_norm = (
            result["founder_education"]
            .astype("string")
            .str.strip()
            .str.casefold()
        )
        no_edu_mask = founder_education_norm == "no higher education"
        founder_uni_reputation.loc[no_uni_mask & no_edu_mask] = "No higher education"
    founder_uni_reputation = founder_uni_reputation.where(pd.notna(founder_uni_reputation), pd.NA)
    result["founder_uni_reputation"] = pd.Categorical(
        founder_uni_reputation, dtype=founder_uni_dtype
    )
    return result

dfs_to_process = [df_global, df_uk, df_usa, df_uk_cb_only]

processed_dfs = [assign_founder_uni_reputation(df) for df in dfs_to_process]

df_global, df_uk, df_usa, df_uk_cb_only = processed_dfs


In [18]:
# Purpose: Quick sanity-check of the global frame after upstream feature merges.

df_global.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1139718 entries, 0 to 1139717
Columns: 106 entries, org_uuid to founder_uni_reputation
dtypes: category(6), datetime64[ns](2), float64(16), int32(1), object(81)
memory usage: 871.7+ MB


## Prior Founding Experience

Merge an externally built `serial_founders.csv` flag onto each dataset via `org_uuid`.

Produces:
- `prior_founding_experience` as a 0/1 indicator.


In [19]:

# Purpose: Merge prior founding experience (serial founder flag) onto each dataset via `org_uuid`.
# NOTE: Missing flags are treated as 0 (no evidence of prior founding experience).

datasets = {
    "df_global": df_global,
    "df_uk": df_uk,
    "df_usa": df_usa,
    "df_uk_cb_only": df_uk_cb_only,
}

SERIAL_FOUNDERS_PATH = Path("/Users/stefan/Desktop/Thesis/v4/Serial Founders Data/serial_founders.csv")
TRUE_VALUES = {"true"}

serial_founders = pd.read_csv(
    SERIAL_FOUNDERS_PATH,
    usecols=["org_uuid", "had_prior_founder"],
    dtype={"org_uuid": "string"},
    low_memory=False,
)

serial_founders["had_prior_founder"] = (
    serial_founders["had_prior_founder"]
    .astype(str)
    .str.strip()
    .str.lower()
    .isin(TRUE_VALUES)
)

prior_flag = (
    serial_founders.dropna(subset=["org_uuid"])
    .groupby("org_uuid", dropna=True)["had_prior_founder"]
    .any()
    .astype("int8")
    .rename("prior_founding_experience")
    .to_frame()
)

def assign_prior_founding_experience(df: pd.DataFrame) -> pd.DataFrame:
    if "org_uuid" not in df.columns:
        raise KeyError("Missing required column 'org_uuid' for prior founding experience.")
    result = df.copy()
    result = result.merge(prior_flag, on="org_uuid", how="left", validate="one_to_one")
    result["prior_founding_experience"] = result["prior_founding_experience"].fillna(0).astype("int8")
    return result

# If missing, rerun the uni-reputation cell to populate it, then run the prior_founding_experience cell.

for key, frame in datasets.items():
    datasets[key] = assign_prior_founding_experience(frame)

globals().update(datasets)


## Company-Level Feature Engineering → Analysis Subsamples

Up to this point, features are added at the company level (`df_global`, `df_uk`, `df_usa`, …). Next, we create *analysis subsets* and *round cohorts* that are used to estimate outcome models.


# Build Filtered Analysis Frames (“Apples”)

Create cleaner, more comparable company-level frames by filtering out problematic observations.

Motivation:
- Reduce missingness in core covariates.
- Remove logically inconsistent closure events (e.g., closed before first funding).

Outputs: `df_global_apples`, `df_uk_apples`, `df_uk_cb_only_apples`, `df_usa_apples`.


In [20]:
# Purpose: (Commented-out) Older, stricter “apples” filter variant retained for reference.

# BAD_ARCHETYPES = {"INSTITUTIONAL", "PRIVATE_EQUITY"}


# def _invalid_closed(frame: pd.DataFrame) -> pd.Series:
#     closed_dt = pd.to_datetime(frame["closed_on"], errors="coerce")
#     funding_dt = pd.to_datetime(frame["first_funding_date"], errors="coerce")
#     status_closed = frame["status"].astype(str).str.strip().str.lower().eq("closed")
#     missing_closed = status_closed & frame["closed_on"].isna()
#     closed_before_funding = (
#         closed_dt.notna() & funding_dt.notna() & (closed_dt <= funding_dt)
#     )
#     return missing_closed | closed_before_funding


# def build_apples(df: pd.DataFrame, name: str) -> pd.DataFrame:
#     required_cols = [
#         "being_funded",
#         "ttf_months",
#         "status",
#         "sector",
#         "founding_cohort",
#         "closed_on",
#         "first_funding_date",
#         # "first_funding_raised_usd",
#         "first_funding_archetype",
#         "founder_education",
#         "top_investor",
#     ]
#     missing = set(required_cols) - set(df.columns)
#     if missing:
#         raise KeyError(f"{name}: missing required columns {missing}")

#     filter_steps = {
#         "being_funded": lambda frame: pd.to_numeric(frame["being_funded"], errors="coerce").ne(1),
#         "ttf_months": lambda frame: pd.to_numeric(frame["ttf_months"], errors="coerce").gt(60),
#         "status": lambda frame: frame["status"].isna() | frame["status"].astype(str).str.strip().eq(""),
#         "sector": lambda frame: frame["sector"].isna() | frame["sector"].astype(str).str.strip().eq(""),
#         "founding_cohort": lambda frame: frame["founding_cohort"].isna()
#         | frame["founding_cohort"].astype(str).str.strip().eq(""),
#         "closed_on": _invalid_closed,
#         "first_funding_date": lambda frame: frame["first_funding_date"].isna() | frame["first_funding_date"].astype(str).str.strip().eq(""),
#         # "first_funding_raised_usd": lambda frame: frame["first_funding_raised_usd"].isna() | frame["first_funding_raised_usd"].astype(str).str.strip().eq(""),
#         "first_funding_archetype": lambda frame: (
#             frame["first_funding_archetype"].isin(BAD_ARCHETYPES)
#             | frame["first_funding_archetype"].isna()
#         ),
#         "founder_education": lambda frame: frame["founder_education"].isna()
#         | frame["founder_education"].astype(str).str.strip().eq(""),
#         "top_investor": lambda frame: frame["top_investor"].isna()
#         | frame["top_investor"].astype(str).str.strip().eq(""),
#     }

#     kept = df.copy()
#     removed_by_col = {}
#     for col in required_cols:
#         mask = filter_steps[col](kept)
#         removed_by_col[col] = int(mask.sum())
#         kept = kept.loc[~mask].copy()

#     total_removed = len(df) - len(kept)
#     print(
#         f"{name}: removed {total_removed:,} rows "
#         + ", ".join(f"{col}={count:,}" for col, count in removed_by_col.items())
#         + f"; kept {len(kept):,}"
#     )
#     return kept


# apples = {
#     "df_global_apples": build_apples(datasets["df_global"], "df_global"),
#     "df_uk_apples": build_apples(datasets["df_uk"], "df_uk"),
#     "df_uk_cb_only_apples": build_apples(datasets["df_uk_cb_only"], "df_uk_cb_only"),
#     "df_usa_apples": build_apples(datasets["df_usa"], "df_usa"),
# }

# globals().update(apples)


In [21]:

# Purpose: Build “apples” analysis frames by filtering missing/invalid core covariates and impossible closures.

def _invalid_closed(frame: pd.DataFrame) -> pd.Series:
    closed_dt = pd.to_datetime(frame["closed_on"], errors="coerce")
    funding_dt = pd.to_datetime(frame["first_funding_date"], errors="coerce")
    status_closed = frame["status"].astype(str).str.strip().str.lower().eq("closed")
    missing_closed = status_closed & frame["closed_on"].isna()
    closed_before_funding = (
        closed_dt.notna() & funding_dt.notna() & (closed_dt <= funding_dt)
    )
    return missing_closed | closed_before_funding


def build_apples(df: pd.DataFrame, name: str) -> pd.DataFrame:
    required_cols = [
        "status",
        "sector",
        "founding_cohort",
        "closed_on",
        # "founder_education",
    ]
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise KeyError(f"{name}: missing required columns {missing}")

    filter_steps = {
        "status": lambda frame: frame["status"].isna() | frame["status"].astype(str).str.strip().eq(""),
        "sector": lambda frame: frame["sector"].isna() | frame["sector"].astype(str).str.strip().eq(""),
        "founding_cohort": lambda frame: frame["founding_cohort"].isna()
        | frame["founding_cohort"].astype(str).str.strip().eq(""),
        "closed_on": _invalid_closed,
        "founder_education": lambda frame: frame["founder_education"].isna()
        | frame["founder_education"].astype(str).str.strip().eq(""),
    }

    kept = df.copy()
    removed_by_col = {}
    for col in required_cols:
        mask = filter_steps[col](kept)
        removed_by_col[col] = int(mask.sum())
        kept = kept.loc[~mask].copy()

    total_removed = len(df) - len(kept)
    print(
        f"{name}: removed {total_removed:,} rows "
        + ", ".join(f"{col}={count:,}" for col, count in removed_by_col.items())
        + f"; kept {len(kept):,}"
    )
    return kept


apples = {
    "df_global_apples": build_apples(datasets["df_global"], "df_global"),
    "df_uk_apples": build_apples(datasets["df_uk"], "df_uk"),
    "df_uk_cb_only_apples": build_apples(datasets["df_uk_cb_only"], "df_uk_cb_only"),
    "df_usa_apples": build_apples(datasets["df_usa"], "df_usa"),
}

globals().update(apples)


df_global: removed 51,446 rows status=0, sector=0, founding_cohort=0, closed_on=51,446; kept 1,088,272
df_uk: removed 2,338 rows status=0, sector=0, founding_cohort=225, closed_on=2,113; kept 80,765
df_uk_cb_only: removed 2,111 rows status=0, sector=0, founding_cohort=0, closed_on=2,111; kept 80,951
df_usa: removed 15,604 rows status=0, sector=0, founding_cohort=0, closed_on=15,604; kept 401,515


# Round Separation

Create *round-specific* subsets (pre-seed / seed / series A / seed→A, etc.) from each base dataset.

These subsets are stored in the dictionary `dfs` and are later augmented with time-to-funding, investor, and macro features.


In [23]:

# Purpose: Generate round-specific subsets per geography (pre-seed / seed / series A / seed→A) into `dfs`.

# ==========================================
# 1. ORGANIZE INPUTS
# ==========================================
# Put your 4 base dataframes into a dictionary for easy looping
# Ensure these variables exist in your environment before running this
base_datasets = {
    'global': df_global,
    'uk': df_uk,
    'uk_cb_only': df_uk_cb_only,
    'usa': df_usa
}

# This dictionary will hold all 20 resulting dataframes
dfs = {}

# ==========================================
# 2. GENERATION LOOP
# ==========================================
print("Generating subsets...")

for name, df in base_datasets.items():
    print(f"\nProcessing: {name} (Total rows: {len(df)})")
    
    # -------------------------------------------------------
    # 1. PRE-SEED: Companies with a value in date_pre_seed
    # -------------------------------------------------------
    key_pre = f"df_{name}_pre_seed"
    # Filter: date_pre_seed is not NaT (Not a Time)
    dfs[key_pre] = df[df['date_pre_seed'].notna()].copy()
    print(f"  -> Created {key_pre}: {len(dfs[key_pre])} rows")

    # -------------------------------------------------------
    # 2. SEED: Companies with a value in date_seed
    # -------------------------------------------------------
    key_seed = f"df_{name}_seed"
    dfs[key_seed] = df[df['date_seed'].notna()].copy()
    print(f"  -> Created {key_seed}: {len(dfs[key_seed])} rows")

    # -------------------------------------------------------
    # 3. SERIES A: Companies with a value in date_series_a
    # -------------------------------------------------------
    key_sa = f"df_{name}_series_a"
    dfs[key_sa] = df[df['date_series_a'].notna()].copy()
    print(f"  -> Created {key_sa}: {len(dfs[key_sa])} rows")

    # -------------------------------------------------------
    # 4. SEED TO SERIES A: Companies with BOTH Seed AND Series A
    # -------------------------------------------------------
    key_s2a = f"df_{name}_seed_to_series_a"
    # Filter: Seed is NOT NaT  AND  Series A is NOT NaT
    mask_both = df['date_seed'].notna() & df['date_series_a'].notna()
    dfs[key_s2a] = df[mask_both].copy()
    print(f"  -> Created {key_s2a}: {len(dfs[key_s2a])} rows")

    # -------------------------------------------------------
    # 5. ANGEL: Companies with a value in date_angel
    # -------------------------------------------------------
    key_angel = f"df_{name}_angel"
    if 'date_angel' in df.columns:
        dfs[key_angel] = df[df['date_angel'].notna()].copy()
        print(f"  -> Created {key_angel}: {len(dfs[key_angel])} rows")
    else:
        print(f"  -> Skipped {key_angel} (Column 'date_angel' not found)")

# ==========================================
# 3. EXPORT TO VARIABLES (OPTIONAL)
# ==========================================
# This allows you to access them directly as variables like 'df_usa_seed'
# instead of dfs['df_usa_seed']
locals().update(dfs)

print("\nAll DataFrames created successfully.")
print("Example check: df_usa_seed_to_series_a shape:", df_usa_seed_to_series_a.shape)

Generating subsets...

Processing: global (Total rows: 1139718)
  -> Created df_global_pre_seed: 13820 rows
  -> Created df_global_seed: 74722 rows
  -> Created df_global_series_a: 34446 rows
  -> Created df_global_seed_to_series_a: 14782 rows
  -> Created df_global_angel: 17446 rows

Processing: uk (Total rows: 83103)
  -> Created df_uk_pre_seed: 961 rows
  -> Created df_uk_seed: 5451 rows
  -> Created df_uk_series_a: 1682 rows
  -> Created df_uk_seed_to_series_a: 975 rows
  -> Created df_uk_angel: 853 rows

Processing: uk_cb_only (Total rows: 83062)
  -> Created df_uk_cb_only_pre_seed: 961 rows
  -> Created df_uk_cb_only_seed: 5451 rows
  -> Created df_uk_cb_only_series_a: 1682 rows
  -> Created df_uk_cb_only_seed_to_series_a: 975 rows
  -> Created df_uk_cb_only_angel: 853 rows

Processing: usa (Total rows: 417119)
  -> Created df_usa_pre_seed: 5008 rows
  -> Created df_usa_seed: 31955 rows
  -> Created df_usa_series_a: 13632 rows
  -> Created df_usa_seed_to_series_a: 7097 rows
  -> 

## Time-to-Funding (TTF) per Round

Compute time-to-funding in months for each round-date column available in each subset.

Workflow:
1. Convert `founded_on` and each `date_*` column to datetime.
2. Compute `(date_round - founded_on)` in months.
3. Compute `months_seed_to_series_a` where both dates exist.
4. Apply additional cleaning rules (venture-window exclusions; drop negative durations).


In [25]:
# Purpose: Compute time-to-funding (TTF) in months for any available `date_*` round columns in each subset.
# NOTE: Computes a Seed→A gap (`months_seed_to_series_a`) when both dates exist.

import pandas as pd
import numpy as np

print("Calculating Time-to-Funding (TTF) in months for ALL available rounds...")

# Average days in a month (365.25 / 12)
AVG_DAYS_IN_MONTH = 30.4375

# Define the standard mappings to check for in every dataframe
# Format: (Date Column Name, Target TTF Column Name)
ROUND_MAPPINGS = [
    ('date_pre_seed', 'ttf_pre_seed_months'),
    ('date_angel',    'ttf_angel_months'),
    ('date_seed',     'ttf_seed_months'),
    ('date_series_a', 'ttf_series_a_months'),
    ('date_series_b', 'ttf_series_b_months'), # Added just in case
]

# Iterate through the dictionary of dataframes
for name, df in dfs.items():
    
    print(f"Processing {name}...")

    # 1. FORCE CONVERT 'founded_on' to datetime
    # We skip if founded_on is missing, as we can't calculate TTF without it
    if 'founded_on' in df.columns:
        df['founded_on'] = pd.to_datetime(df['founded_on'], errors='coerce')
    else:
        print(f"  [!] Skipped: 'founded_on' missing in {name}")
        continue

    # 2. ITERATE through all potential rounds
    # This checks if the column exists in the DF, regardless of the DF's name
    for date_col, ttf_col in ROUND_MAPPINGS:
        if date_col in df.columns:
            # Force convert to datetime
            df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
            
            # Calculate TTF in Days, then divide by 30.44
            df[ttf_col] = (df[date_col] - df['founded_on']) / np.timedelta64(1, 'D') / AVG_DAYS_IN_MONTH
            
            print(f"  -> Created '{ttf_col}'")

    # 3. SPECIAL CASE: Gap between Seed and Series A
    # We calculate this only if BOTH dates are present in the dataframe
    if 'date_seed' in df.columns and 'date_series_a' in df.columns:
        # (Dates were already converted in the loop above)
        df['months_seed_to_series_a'] = (df['date_series_a'] - df['date_seed']) / np.timedelta64(1, 'D') / AVG_DAYS_IN_MONTH
        print(f"  -> Created 'months_seed_to_series_a'")

# Update locals to ensure changes persist
locals().update(dfs)

print("\nCalculation complete. (No log transformation applied)")

Calculating Time-to-Funding (TTF) in months for ALL available rounds...
Processing df_global_pre_seed...
  -> Created 'ttf_pre_seed_months'
  -> Created 'ttf_angel_months'
  -> Created 'ttf_seed_months'
  -> Created 'ttf_series_a_months'
  -> Created 'ttf_series_b_months'
  -> Created 'months_seed_to_series_a'
Processing df_global_seed...
  -> Created 'ttf_pre_seed_months'
  -> Created 'ttf_angel_months'
  -> Created 'ttf_seed_months'
  -> Created 'ttf_series_a_months'
  -> Created 'ttf_series_b_months'
  -> Created 'months_seed_to_series_a'
Processing df_global_series_a...
  -> Created 'ttf_pre_seed_months'
  -> Created 'ttf_angel_months'
  -> Created 'ttf_seed_months'
  -> Created 'ttf_series_a_months'
  -> Created 'ttf_series_b_months'
  -> Created 'months_seed_to_series_a'
Processing df_global_seed_to_series_a...
  -> Created 'ttf_pre_seed_months'
  -> Created 'ttf_angel_months'
  -> Created 'ttf_seed_months'
  -> Created 'ttf_series_a_months'
  -> Created 'ttf_series_b_months'
  -

In [26]:
# Purpose: Apply “venture window” exclusions to drop extremely long TTF outliers by round type.

print("Applying 'Venture Window' exclusions to homogenize sample...")

# Thresholds in Months
MAX_MONTHS_PRE_SEED = 3 * 12  # 36 months
MAX_MONTHS_SEED     = 5 * 12  # 60 months
MAX_MONTHS_SERIES_A = 7 * 12  # 84 months

for name in list(dfs.keys()): # List() to avoid runtime error if dict changes size
    df = dfs[name]
    initial_len = len(df)
    mask = None
    limit_desc = ""
    col_used = ""

    # 1. Pre-Seed Logic
    if name.endswith("_pre_seed"):
        col_used = 'ttf_pre_seed_months'
        if col_used in df.columns:
            # Filter: non-negative AND within 3 years
            mask = (df[col_used] >= 0) & (df[col_used] <= MAX_MONTHS_PRE_SEED)
            limit_desc = "3 years"

    # 2. Seed Logic
    # Note: For 'seed_to_series_a', the cohort is defined by the Seed round, 
    # so we filter companies that took >5 years to reach Seed.
    elif name.endswith("_seed") or "seed_to_series_a" in name:
        col_used = 'ttf_seed_months'
        if col_used in df.columns:
            # Filter: non-negative AND within 5 years
            mask = (df[col_used] >= 0) & (df[col_used] <= MAX_MONTHS_SEED)
            limit_desc = "5 years"

    # 3. Series A Logic
    elif name.endswith("_series_a"):
        col_used = 'ttf_series_a_months'
        if col_used in df.columns:
            # Filter: non-negative AND within 7 years
            mask = (df[col_used] >= 0) & (df[col_used] <= MAX_MONTHS_SERIES_A)
            limit_desc = "7 years"

    # Apply the filter 
    if mask is not None:
        df_filtered = df[mask].copy()
        dfs[name] = df_filtered
        
        dropped_count = initial_len - len(df_filtered)
        if dropped_count > 0:
            pct = (dropped_count / initial_len) * 100
            print(f"  [{name}] Dropped {dropped_count} rows ({pct:.1f}%) > {limit_desc}. New N={len(df_filtered)}")
        else:
            print(f"  [{name}] No outliers found > {limit_desc}.")
    else:
        # Fallback if the expected column is missing
        print(f"  [{name}] SKIPPED: Could not find time column.")

# Update local variables (df_global_seed, etc.) to match the filtered versions in the dict
locals().update(dfs)

print("\nOutlier exclusion complete.")

Applying 'Venture Window' exclusions to homogenize sample...
  [df_global_pre_seed] Dropped 3857 rows (27.9%) > 3 years. New N=9963
  [df_global_seed] Dropped 9245 rows (12.4%) > 5 years. New N=65477
  [df_global_series_a] Dropped 4824 rows (14.0%) > 7 years. New N=29622
  [df_global_seed_to_series_a] Dropped 1165 rows (7.9%) > 5 years. New N=13617
  [df_global_angel] SKIPPED: Could not find time column.
  [df_uk_pre_seed] Dropped 312 rows (32.5%) > 3 years. New N=649
  [df_uk_seed] Dropped 899 rows (16.5%) > 5 years. New N=4552
  [df_uk_series_a] Dropped 333 rows (19.8%) > 7 years. New N=1349
  [df_uk_seed_to_series_a] Dropped 150 rows (15.4%) > 5 years. New N=825
  [df_uk_angel] SKIPPED: Could not find time column.
  [df_uk_cb_only_pre_seed] Dropped 259 rows (27.0%) > 3 years. New N=702
  [df_uk_cb_only_seed] Dropped 663 rows (12.2%) > 5 years. New N=4788
  [df_uk_cb_only_series_a] Dropped 310 rows (18.4%) > 7 years. New N=1372
  [df_uk_cb_only_seed_to_series_a] Dropped 81 rows (8.3%

In [27]:
# Purpose: Remove rows with negative (impossible) time-to-funding durations across relevant TTF columns.

import pandas as pd

print("Checking for impossible timelines (Negative TTF)...")

cleaned_count = 0

for name, df in dfs.items():
    cols_to_clean = []
    
    # Identify relevant columns
    possible_cols = [
        'ttf_pre_seed_months', 
        'ttf_angel_months', 
        'ttf_seed_months', 
        'ttf_series_a_months',
        'months_seed_to_series_a'
    ]
    
    for col in possible_cols:
        if col in df.columns:
            cols_to_clean.append(col)

    initial_rows = len(df)
    
    # Start with all True
    mask_valid = pd.Series(True, index=df.index)
    
    for col in cols_to_clean:
        # 1. Log the actual negatives (for your info)
        negatives = df[col] < 0
        if negatives.sum() > 0:
            print(f"  [{name}] Found {negatives.sum()} rows with negative {col}")
        
        # 2. Build the Safe Mask
        # Keep row IF: (Value is NaN) OR (Value >= 0)
        # This prevents dropping rows just because they don't have that specific round
        col_is_valid = (df[col].isna()) | (df[col] >= 0)
        
        mask_valid = mask_valid & col_is_valid

    # Apply filter
    if not mask_valid.all():
        df_clean = df[mask_valid].copy()
        dfs[name] = df_clean
        cleaned_count += (initial_rows - len(df_clean))

# Update locals
locals().update(dfs)

print(f"\nCleanup complete. Removed {cleaned_count} total rows across all datasets.")

Checking for impossible timelines (Negative TTF)...
  [df_global_pre_seed] Found 20 rows with negative ttf_angel_months
  [df_global_seed] Found 183 rows with negative ttf_pre_seed_months
  [df_global_seed] Found 66 rows with negative ttf_angel_months
  [df_global_series_a] Found 86 rows with negative ttf_pre_seed_months
  [df_global_series_a] Found 106 rows with negative ttf_angel_months
  [df_global_series_a] Found 257 rows with negative ttf_seed_months
  [df_global_seed_to_series_a] Found 65 rows with negative ttf_pre_seed_months
  [df_global_seed_to_series_a] Found 26 rows with negative ttf_angel_months
  [df_global_angel] Found 25 rows with negative ttf_pre_seed_months
  [df_global_angel] Found 288 rows with negative ttf_angel_months
  [df_global_angel] Found 90 rows with negative ttf_seed_months
  [df_global_angel] Found 15 rows with negative ttf_series_a_months
  [df_uk_pre_seed] Found 3 rows with negative ttf_angel_months
  [df_uk_seed] Found 39 rows with negative ttf_pre_seed_

In [28]:
# Purpose: Quick spot-check of the computed TTF values for the UK seed subset.

df_uk_seed['ttf_seed_months'].head()

31     16.000000
67     11.301848
95     29.864476
126     0.459959
167    29.108830
Name: ttf_seed_months, dtype: float64

## Investor Archetypes

Classify the *dominant* investor type for the round based on Crunchbase investor metadata.

Steps:
- Load `investors.csv` (UUID → investor_types).
- Parse the UUID strings stored on each funding round.
- Apply a strict hierarchy to select a single archetype (VC, corporate VC, accelerator, …).


In [29]:
# Purpose: Classify a dominant investor archetype per round using Crunchbase investor types and a hierarchy.

import pandas as pd
import numpy as np
from collections import Counter

# ==========================================
# 1. LOAD INVESTORS & BUILD LOOKUP
# ==========================================
# (This part remains the same)
investors_path = '/Users/stefan/Desktop/Thesis/v4/Crunchbase Data/bulk_export/investors.csv'

print(f"Loading investors from: {investors_path}")
df_investors = pd.read_csv(investors_path, usecols=['uuid', 'investor_types'])
df_investors = df_investors.dropna(subset=['investor_types'])

print("Building UUID -> Type lookup map...")
uuid_type_map = {}

# Convert to dict
raw_map = df_investors.set_index('uuid')['investor_types'].to_dict()

# Normalize the lookup table
for uuid, types_str in raw_map.items():
    if isinstance(types_str, str):
        uuid_type_map[uuid] = [t.strip().lower() for t in types_str.split(',')]

print(f"Mapped {len(uuid_type_map)} investors with valid types.")

# ==========================================
# 2. DEFINITIONS (HIERARCHY)
# ==========================================

ARCTYPES = {
    "VC": {"venture_capital", "micro_vc", "venture_debt"},
    "CORPORATE_VC": {"corporate_venture_capital"},
    "FAMILY_OFFICE": {"family_investment_office"},
    "ACCELERATOR": {
        "accelerator",
        "incubator",
        "co_working_space",
        "entrepreneurship_program",
        "startup_competition",
    },
    "PRIVATE_EQUITY": {"private_equity_firm"},
    "INSTITUTIONAL": {
        "hedge_fund",
        "fund_of_funds",
        "investment_bank",
        "pension_funds",
        "secondary_purchaser",
        "government_office",
        "university_program",
    },
    "ANGEL": {"angel", "investment_partner", "angel_group", "syndicate"},
}

# The Strict Priority Order
# The function will check for these in order. The first one found "wins".
PRIORITY_ORDER = [
    "VC",
    "CORPORATE_VC",
    "FAMILY_OFFICE",
    "ACCELERATOR",
    "PRIVATE_EQUITY",
    "INSTITUTIONAL",
    "ANGEL",
    "OTHER"
]

archetype_dtype = pd.CategoricalDtype(PRIORITY_ORDER, ordered=True)

# ==========================================
# 3. CLASSIFICATION LOGIC (Hierarchy Rule)
# ==========================================

def parse_uuid_string(val):
    if pd.isna(val) or str(val).strip() == "":
        return []
    clean_str = str(val).replace(",", " ").replace("|", " ")
    return [x.strip() for x in clean_str.split() if x.strip()]

def get_dominant_archetype(uuid_string: str) -> object:
    # 1. Parse UUIDs
    uuids = parse_uuid_string(uuid_string)
    
    if not uuids:
        return "OTHER"
    
    # 2. Collect ALL investor types present in this round
    round_tokens = set()
    for u in uuids:
        types = uuid_type_map.get(u, [])
        round_tokens.update(types)
        
    if not round_tokens:
        return "OTHER" 

    # 3. Check Hierarchy (Highest Priority Wins)
    # We iterate through the PRIORITY_ORDER. If any token in the round
    # matches the definition of that archetype, we return it immediately.
    
    for archetype in PRIORITY_ORDER:
        if archetype == "OTHER":
            continue
            
        lexemes = ARCTYPES[archetype]
        
        # Check if ANY token in the round matches ANY lexeme for this archetype
        # We look for exact match OR prefix match (e.g. 'angel_group')
        for token in round_tokens:
            if token in lexemes or any(token.startswith(f"{lex}_") for lex in lexemes):
                return archetype
            
    # 4. Fallback
    # If we found tokens but none matched our specific definitions
    return "OTHER"

# ==========================================
# 4. APPLY TO DATAFRAMES
# ==========================================
print("\nCalculating 'investor_type' (Dominant Logic) for all datasets...")

for name, df in dfs.items():
    
    target_col = None
    if name.endswith("_pre_seed"):
        target_col = 'uuids_pre_seed'
    elif name.endswith("_angel"):
        target_col = 'uuids_angel'
    elif name.endswith("_seed") or "seed_to_series_a" in name:
        target_col = 'uuids_seed'
    elif name.endswith("_series_a"):
        target_col = 'uuids_series_a'
        
    if target_col and target_col in df.columns:
        # Update Logic to use get_dominant_archetype
        df['investor_type'] = df[target_col].apply(get_dominant_archetype).astype(archetype_dtype)
        
        print(f"\n--- {name} (based on {target_col}) ---")
        if 'investor_type' in df.columns:
            stats = (
                df['investor_type']
                .value_counts(normalize=True)
                .mul(100)
                .sort_index()
                .round(2)
            )
            print(stats.to_string())
    else:
        pass

# Update global variables
locals().update(dfs)
print("\nProcessing complete.")

Loading investors from: /Users/stefan/Desktop/Thesis/v4/Crunchbase Data/bulk_export/investors.csv
Building UUID -> Type lookup map...
Mapped 177550 investors with valid types.

Calculating 'investor_type' (Dominant Logic) for all datasets...

--- df_global_pre_seed (based on uuids_pre_seed) ---
investor_type
VC                38.73
CORPORATE_VC       1.08
FAMILY_OFFICE      0.43
ACCELERATOR       22.67
PRIVATE_EQUITY     0.26
INSTITUTIONAL      0.93
ANGEL              6.67
OTHER             29.24

--- df_global_seed (based on uuids_seed) ---
investor_type
VC                47.30
CORPORATE_VC       1.09
FAMILY_OFFICE      0.38
ACCELERATOR       11.02
PRIVATE_EQUITY     0.93
INSTITUTIONAL      0.75
ANGEL              7.09
OTHER             31.44

--- df_global_series_a (based on uuids_series_a) ---
investor_type
VC                74.04
CORPORATE_VC       0.96
FAMILY_OFFICE      0.28
ACCELERATOR        0.56
PRIVATE_EQUITY     3.79
INSTITUTIONAL      0.66
ANGEL              2.45
OTHER     

## Top Investors

Flag whether a round includes a lead/top investor from a curated set of major funds and individuals.

Produces:
- `top_investor`: boolean indicator.


In [30]:
# Purpose: Flag whether the round involves a curated set of “top” investors (funds/individuals).

import pandas as pd

# ==========================================
# 1. DEFINE TOP INVESTOR SET
# ==========================================
TOP_INVESTOR_IDS = {
    # VCs
    "d3184029-74d8-7e6c-717b-84ac89d0a837",  # Tiger Global Management
    "0c867fde-2b9a-df10-fdb9-66b74f355f91",  # Sequoia Capital
    "ce91bad7-b6d8-e56e-0f45-4763c6c5ca29",  # Andreessen Horowitz
    "47b84763-9727-7cdf-b194-2742e3963147",  # New Enterprise Associates
    "d5df3873-7871-c608-0284-c74d0b555ccd",  # Lightspeed Venture Partners
    "6d4a585b-6802-dc2c-c1e5-8b7cac2c5f89",  # Dragoneer Investment Group
    "b915e540-3377-6a2a-651e-6fd7c0787e26",  # TCV
    "beadb218-e5fa-2686-bc95-4dfaa9acc2e8",  # Bessemer Venture Partners
    "b08efc27-da40-505a-6f9d-c9e14247bf36",  # Accel
    "01b78c00-b6e2-7cbc-8286-8d14d09d441e",  # OrbiMed
    "7a42b564-4bb6-5864-6cdb-a0100008f3b3",  # Battery Ventures
    "19018b61-31aa-4eb1-a319-1e3147334e45",  # Deerfield Management
    "fe5a4983-a46a-2fc2-5633-e35e0a86b694",  # Khosla Ventures
    "985d1d30-9137-0dd6-bcf8-f8f88fe82b3b",  # Ascension Ventures
    "f2ba44f2-9258-40af-922b-213890f828de",  # Kinetic Investments
    "0686cdc6-d6e7-0417-8329-29b5c85afeef",  # Seedcamp
    "263539cd-fdd5-dd6e-0942-4188e5380e61",  # MMC Ventures
    "2da43d4c-2e57-4782-b35f-c1f8faf03d39",  # AlbionVC
    "be0b97a1-9293-b526-df88-a80d2b255145",  # Amadeus Capital Partners
    "fd0c6051-176b-b3a6-2124-b071945d3b02",  # Notion Capital
    "d27114f3-3e81-24f8-c5c9-ad29fb114e9c",  # Playfair Capital
    "7eb2cae7-3762-803e-4dc2-61b3f424d317",  # Molten Ventures
    "c901d65e-c346-e1a8-7a87-58dd6ca991f9",  # Octopus Ventures
    "60b5dade-24ed-84b0-99e2-1ecda554a1ab",  # Index Ventures
    "6f55b001-be09-e530-ddc2-a3a5b42f37ab",  # Hoxton Ventures
    "49cf2639-71dc-1485-0593-fffd18d51d15",  # Episode 1
    "7fe5c779-d231-b528-914a-69b8cfae8538",  # BGF
    "da761bfc-adb7-e9b1-b90d-52d436c96e75",  # Entrepreneurs First
    "f320dda2-956b-f048-9759-1f700e7101c5",  # Passion Capital
    "ff5d7781-999a-b2f7-097c-5d39998fd021",  # SFC Capital
    "56e40f50-97c7-2a77-255d-1d97d5f30646",  # 500 Global
    "d0cdfdc0-517d-ce18-4a74-194f506bccad",  # Balderton Capital
    "308fb2fa-4610-0909-1d34-17216d007b21",  # Boldstart Ventures
    "9da42114-b627-6926-f3b7-fd07e23fc337",  # BoxGroup
    "b281bc29-17cd-ab2e-488d-ddb22a514c6f",  # Cherry Ventures
    "4ede174d-3254-8602-e977-d9c0bfe34433",  # Creandum
    "dcfb1ae3-35b1-0bf2-0b3d-595031ab4507",  # DCVC
    "08ebf8ae-4acb-955c-02c5-305c0f470477",  # Entrée Capital
    "c250ecc2-4f96-b21b-4485-23245263a764",  # Eurazeo
    "20242b76-1950-6156-2811-f7171fc99de8",  # FJ Labs
    "735a53c6-a7ba-2434-d97d-b788536da89d",  # Felicis Ventures
    "043d9e52-dcc0-0dd8-6074-206e42e20e13",  # First Round Capital
    "423c1d4b-139e-444f-9022-01e30af5c9cc",  # Fondation FIT
    "6b43c7ec-5898-0866-af46-494775acb00f",  # Founder Collective
    "fb2f8884-ec07-895a-48d7-d9a9d4d7175c",  # Founders Fund
    "473c465f-28d4-2568-c632-b2d5dd77c541",  # FundersClub
    "44c42bcb-3317-4b0c-8fc9-42dfa390c4e9",  # GFC
    "3c62db15-4db8-016a-145e-dd1a06d168e2",  # GV
    "34a9bb66-0984-8de0-e7fb-13e976b4a135",  # General Catalyst
    "8c2904bd-f9a1-0f41-87db-16fae614ad2d",  # Great Oaks Venture Capital
    "e2006571-6b7a-e477-002a-f7014f48a7e3",  # Greylock
    "d370dc6d-d969-876f-bf24-dccf34b84a91",  # HV Capital
    "e76a371e-6afb-49f9-9cb5-ef7aa8390687",  # Hongshan Fund
    "343a0c91-d3b9-80fa-2330-cfde495decf1",  # IDG Capital
    "e7442a98-a043-dd5c-5def-ebcc56fb78ce",  # Initialized Capital
    "6efb9171-a87b-212d-f399-b0613da7b1c9",  # Insight Partners
    "d1bc5ab8-adde-56aa-5dfc-d69d0377430d",  # Kima Ventures
    "f6a3df1b-7f7c-c517-b1d0-5441069dd097",  # LocalGlobe
    "2c8502f9-cf1a-7b91-9f10-7976a1bf753a",  # Mangrove Capital Partners
    "0fd7d238-28b7-4540-a5a9-dbd65cd033f4",  # Matrix Partners
    "dbdafc98-7c6b-9cf1-b1bf-d3f590197dd1",  # Northzone
    "7475db43-36a9-414d-f7a9-b92b78bae47b",  # Partech
    "93c87ab7-7cee-5c54-9941-104693dbbdba",  # Plug and Play
    "e8de4e94-b48c-76f6-3770-aac7e51158f4",  # Point Nine Capital
    "d29973c3-95ec-691d-defd-6a6ba713c72a",  # Redpoint Ventures
    "902deab4-ec41-68ce-d9df-c0c959578176",  # SV Angel
    "1a410398-3a72-5882-99b8-6318cf594850",  # SoftBank
    "06059314-d10f-c088-30c8-5a71e40ef75d",  # Speedinvest
    "34ec3a05-c405-9929-5ff6-b18ab93d620e",  # ZhenFund

    # Angels
    "bb902061-1366-30d9-5ba2-566f5edf065c",  # Hesham Zreik
    "16496cb3-1c18-98f1-0eff-3819930827a8",  # Edward Lando
    "67d7b771-7720-8d6a-127a-2bf3ac0dbac6",  # Bashar Hamood
    "f0dd4891-7d2a-8bab-9a0b-a9cd46c54910",  # Kunal Shah
    "bb0955c3-af0d-2d01-cd4f-985a93252432",  # Naval Ravikant
    "2907e34e-15b2-ec8f-9815-f9f0b70e7235",  # Fabrice Grinda
    "ebae6782-7040-b515-9c89-36fb0a31e113",  # Mark Cuban
    "af778eb2-d12c-d953-7d7c-2e3bd5581cfe",  # Scott Belsky
    "d6598180-584e-4d64-9dff-6189bb4a0b30",  # Elad Gil
    "c8e0c690-e891-4115-bb57-5652199c8aeb",  # Charlie Songhurst
    "2ae517bd-f03e-b7fc-8e79-8448627d15c4",  # Gokul Rajaram
    "1958e5fb-f808-c866-81b4-c6e27f77edd4",  # Balaji Srinivasan
    "6f618ed3-037d-427b-9d85-deca827d6dfb",  # Nadav Ben-Chanoch
    "61d19fc5-3a69-e4f7-9a6d-83d4fb6a0212",  # Marc Benioff
    "1417de64-9e75-ff31-eb74-b52263a98414",  # Daniel Curran
    "564bdcf3-8bf7-4040-dac2-072af3e72e6f",  # Paul Buchheit
    "11340da3-b2bb-5d51-19bc-c6006d151164",  # Chris Adelsbach
    "6c98aa6a-0383-ea74-c27c-996b38daa5ed",  # Alexis Ohanian
    "69ba8866-c721-28e3-8b53-d10d81cece26",  # Scott Banister
    "ad02a619-4e2b-4b25-a831-973142649079",  # Jon Oringer
    "f2ff09b3-343c-67b1-2469-6ab0cfed43b9",  # Shervin Pishevar
    "038f8cd4-7b1e-be6e-f45d-d952c14e6802",  # Wei Guo
    "1601ab0b-87bb-5ee5-9545-c289d4f706cf",  # Justin Mateen
    "48799440-ee04-35a0-1f7c-9ece8b4aa6ef",  # Lachy Groom
    "3de3ceed-1a9f-b710-5c5e-494e218155a5",  # Kevin Mahaffey
    "3f47be49-2e32-8118-01a0-31685a4d0fd7",  # Peter Thiel
    "6e037bba-68cc-c247-0148-c565408fd131",  # Esther Dyson
    "f63055a5-a077-f7ad-2ace-9dbbac22a366",  # Sam Altman
    "25ca837a-83ed-5169-7124-b10de7d101a4",  # Louis Beryl
    "eefb8c84-89cb-d1bc-4286-06c4948f4607",  # Tom Williams
    "1975faea-863c-0f5f-31ce-eb3b9ff160cc",  # Sahin Boydas
    "8c4276d0-b08b-9dd0-0f55-b6fd72b80fb5",  # Xavier Niel
    "5ab0b475-7d0b-0dba-f640-4e6ddae15101",  # Kevin Lin
    "218269b5-0630-56a5-da47-43b4dc30d944",  # Tim Draper
    "43b343a7-ff56-090a-63fa-5f833e55ea6d",  # Ronald Conway
    "67ae31b5-9f5e-6ea2-1fbb-ba34b6c7f5de",  # Cyan Banister
    "cd730db2-a37f-0d5b-9b34-185edafe5017",  # Max Levchin
    "379838f6-5109-44b6-a401-8e90675de962",  # Sandeep Nailwal
    "b41707f2-994c-da6c-11e5-a2086c1409d9",  # Anupam Mittal
    "77d5871f-f84c-8bd1-38f5-ce36ea09ed55",  # Kevin Hartz
    "72734ad0-f180-155e-55ae-4832d2ad4b1b",  # Clark Landry
    "c71fb8a9-51aa-02d8-84c0-e80049d7be46",  # Simon Murdoch
    "54e6b91c-bca3-4b98-61f0-d33bcd359a3e",  # Dylan Field
    "e81f7292-ab90-961d-3adc-35a46e237914",  # Bradley Horowitz
    "b4837bf7-544c-3d22-53e3-02b94ecbbe3c",  # Taavet Hinrikus
    "6bbee73e-72ca-be5f-9cd2-ddc54a8b237f",  # Reid Hoffman
    "0d54b761-d5aa-be50-01fb-4cd07fd0ca93",  # Ramakant Sharma
    "a3fec26e-0496-dbfa-38d0-188fe7f9d718",  # Kevin Moore
    "601dc2f0-6378-e37b-f1ab-6032b6ab3715",  # Nat Friedman
    "1f54d726-502a-cf68-bb0c-8292123f8037",  # Wayne Chang
    "8d9b5e65-ddb3-4ac0-9be5-ca4e254948f1",  # Thibaud Elziere
    "21362c6c-6df1-95a9-a2ff-a1bd1b8edaf8",  # Nitesh Banta
    "82f78c17-46c6-7ca4-9505-ee69a6aa0033",  # George Burke
    "2cbe1ecc-7e49-6971-9e69-ddaa7a5ae4a7",  # David Tisch
    "c938b6b5-7b4b-4952-b38f-8ee7484d8d1f",  # Private Investors
    "62ab53b0-3a28-0cb7-9995-fd3215b78e41",  # Bill Gates
    "0159561f-be5d-81eb-7ccd-b4d6efeb30fc",  # Mark Pincus
    "356bd44c-09ca-54fb-9a0f-f30167968a6f",  # Guillermo Rauch
    "abbda1d1-ac28-e1cc-1e91-6070a60a2c81",  # Auren Hoffman
    "35f26b49-c20c-096a-9723-f41f745d1a43",  # Binny Bansal
    "e042a336-304b-0d89-af42-0020d7f4a307",  # Arjun Sethi
    "84e270f3-3263-602a-742e-57df022c2b93",  # Kunal Bahl
    "34bd99e3-d8e8-2cc0-7f2a-945e4429fb90",  # Rajan Anandan
    "d37c75fb-83ab-9fc0-0e13-966efea9736d",  # Justin Kan
    "ce540c0b-8efd-b812-9cd9-e08bb7a212df",  # Garry Tan
    "9ab96318-bdda-0378-e188-707f0960cbb3",  # James Sowers
    "907913de-9233-9bd7-a4bf-cfa4f45cc52b",  # Shane Neman
    "2bcb5152-8a78-3cc1-fd81-e76bc052ecf6",  # Dharmesh Shah
    "36463a2c-6cfe-d3ad-cdb2-60e4b8e80d6b",  # Gary Vaynerchuk
    "a3dd4b2c-873e-71d2-367e-32e846f79f15",  # Eduardo Ronzano
    "d7c61edb-c3f6-0dda-9852-ac2e1524ed99",  # Chris Sang
    "888b4628-b2b2-2d86-699e-0fcad21dec05",  # Benjamin Ling
    "8365a39b-5cc0-7d9c-6b10-9a03284d9fc0",  # Jeremy Yap
    "d82f5588-d498-5fdb-5551-a7456997c198",  # Jack Altman
    "eeb4b3e4-fa14-15e4-462a-d74dcb122f74",  # Lee Linden
    "fae1d08c-47cd-7da0-b8ff-4b24deab76b6",  # Immad Akhund
    "6de343b8-86f5-d194-7368-08877be13323",  # Rohit Bansal
    "5169785f-2492-4dfe-4fec-01b8594d7f5a",  # Joanne Wilson
    "5ebfc8db-c498-b5d8-4066-c4584bd96d18",  # Arash Ferdowsi
    "2b8816c2-c07d-9294-2c20-a324c92e94b1",  # Daren Cotter
    "ed970888-34df-cf7c-46b7-441d54240933",  # Ashton Kutcher
    "3a476a62-d930-8aab-041d-cfe5e45a679a",  # Spencer Rascoff
    "9d8fa74d-3532-2074-adf3-beb2ff3de31a",  # Rob Dobson
    "f88b42c1-8a40-2652-ad2a-cdc79463005a",  # Joshua Schachter
    "2433080a-d984-4c6a-91d3-9375488c7fc0",  # Bryan Rosenblatt
    "2b2b6a73-5717-4e23-8b21-51a684a97281",  # Roman Smolevskiy
    "4e01b3df-fab2-33ee-98ec-d3fc3fa2f2ed",  # Eric Schmidt
    "143aa0dc-3e17-e4a2-65c4-65cdace51fba",  # Brendan Wallace
    "a84e3ad4-46f1-4703-3d07-a558c4f6d9e1",  # Daniel Gross
    "759ccd3b-a622-8d14-5656-86662013a1e6",  # Brad Flora
    "6f8557a8-cce3-f2bd-259b-80d583f3c356",  # Vishal Rao
    "2796310b-3159-922a-9306-4bc91f7e1dab",  # Nat Turner
    "d1de4eda-39c6-4182-6811-c1b9c9b9048a",  # Sebastien Borget
    "43c89532-81ca-69d1-639a-e114965ff042",  # Vijay Shekhar Sharma
    "74a28ec8-c7e4-0810-2785-25e25cea88be",  # Keith Rabois
    "cede4417-bffb-e76e-58b6-1de15e32e1ff",  # Farzad Nazem
    "7247b525-1c8c-a396-a5d2-89863f4d3c79",  # Eric Ries
    "e98345ec-cc19-bd4c-f502-affff6b3ec87",  # Avichal Garg
    "d91bdc88-5486-e417-1ff3-34a43e0e10b7",  # Dave Morin
    "3ed25b37-f4a7-c19f-3874-53d1969399cf",  # Paul Forster

    # Accelerators
    "73633ee4-ea65-2967-6c5d-9b5fec7d2d5e",  # Y Combinator
    "3718597a-dd39-6661-3630-09cdd43bcac2",  # Techstars
    "183692b0-b175-b125-89ce-2badd9f56b55",  # MassChallenge
    "56e40f50-97c7-2a77-255d-1d97d5f30646",  # 500 Global
    "39041e62-6b24-ae8d-1347-4cea947e832c",  # SOSV
    "93c87ab7-7cee-5c54-9941-104693dbbdba",  # Plug and Play
    "5167b830-a941-ed08-d275-f74473d13e91",  # Google for Startups
    "9bdae0c0-7eb4-c3d3-08eb-7014b093d938",  # Newchip Accelerator
    "ae6eb8da-ebb3-42be-9ee6-644a57af8755",  # European Innovation Council
    "d7726d53-a989-310b-5033-52d48e7822b6",  # FasterCapital
    "26e824b5-c141-ebee-840c-59ab815903ba",  # VentureOut
    "e26b98e6-3997-577d-eb5b-a195e31d03e3",  # Start-Up Chile
    "d99c68ca-9d26-914e-5838-9ead25f503f3",  # Cleantech Open
    "9d11d829-645f-4ebe-bbd1-e84d1a0ddaf0",  # Pioneer Fund
    "0812e67e-d94d-27e0-8112-8f63c7337271",  # HAX
    "0686cdc6-d6e7-0417-8329-29b5c85afeef",  # Seedcamp
    "136b84b0-27ae-1fd5-6a44-379bc5ad9e18",  # Startup Wise Guys
    "310d858d-653e-90d4-2527-dc9a555326e7",  # Startupbootcamp
    "2d296802-0ac2-3ee5-5d07-4e9383655a72",  # LAUNCH
    "e96105f1-25d2-af8d-7ecb-3da5e2d8aef7",  # Alchemist Accelerator
    "983675c3-2ee5-471f-6b57-c9e32a0369c1",  # IndieBio
    "902a18e7-1af3-4fc2-8267-499691fb5df7",  # gbeta
    "4cecfbfe-af23-caac-f464-452a60a810c2",  # Forum Ventures
    "7645960f-f48e-4759-897d-39f79381ce6a",  # Comcast RISE
    "4330106a-a1f5-2edb-c71e-ff03738d0b54",  # SkyDeck Berkeley
    "70503993-5927-4256-a30a-c0ac1a7b6b99",  # Orbit Startups
    "da761bfc-adb7-e9b1-b90d-52d436c96e75",  # Entrepreneurs First
    "f8c55c25-d1a1-9788-2159-3814c7f9276d",  # IIDF
    "e2c6cc8e-9880-ce20-9ba4-95b593afb930",  # AGORANOV
    "5b28f67a-8fd1-4f61-6fde-c5d92af09440",  # Ontario Centres of Excellence
    "0003f244-79d0-6178-353e-33dabaf3b2c6",  # NFX
    "b4386e5f-53a6-fc77-ebd4-b6e0bef0aaef",  # ERA
    "809d0f70-9c3a-48ba-9c04-fc6e75edc6e5",  # EIC Accelerator
    "5e459098-bee0-40fb-9bc2-91d704615b10",  # AcceliCITY
    "047bfe44-727e-360a-caa0-c3ca8d18c48e",  # MedTech Innovator
    "01e00179-026f-0725-2530-fb93e449587d",  # Village Capital
    "9386e62e-e1e1-be03-492a-0e3660e66120",  # Outlier Ventures
    "901c3ad2-30a6-a5ce-61be-d45810cb515a",  # IQT
    "fdabe06e-0f58-3254-2d4d-040cb02a1070",  # StartX
    "b4a54676-bf60-76ed-dd1e-853c0f6ecea1",  # Tenity
    "35e5ca71-ee44-1043-abaa-e9578d72d28a",  # Brinc
    "fbc59c0f-ac0b-45ec-ad88-5379fef2d28d",  # Paul G. Allen Family Foundation
    "077480d2-c6b9-4cca-97f1-f5444c0d95c7",  # ArtsFund
    "a9e1d132-2311-059e-0976-5f26c52acc1e",  # NEXT Canada
    "8a1dee62-de42-f204-13fa-1d68d3413ca8",  # EXPERT DOJO
    "e89c7650-1b44-86fd-c946-ed10858d90a9",  # Sting
    "459b0b0d-eae2-e063-6f65-69e85c913348",  # Mucker Capital
    "4ad04e56-97a8-b81a-f3a5-cb9a84919e9a",  # SVG Ventures
    "439d7bbf-11ae-7ddf-1887-727188ac065d",  # gener8tor
    "ceebed96-9b79-0c41-3ec6-a7d0f4b7adf7",  # CyberAgent Capital
    "c8e28bad-4ceb-a320-716b-f5d8b6f95565",  # DMZ
    "a72ef1d8-0d27-3bb4-8038-6122559c5d4f",  # Springcamp
    "e251cd43-97f7-4d9b-9138-f79f234848cb",  # Rockstart
    "df5b91ed-e271-4bb7-b091-fb0157b0f33e",  # CNTTECH
    "ec52a88c-facd-9abf-4b4d-0bd743779dff",  # EIT InnoEnergy
    "a9bb5837-4190-2f46-6740-c673f233b911",  # BonAngels Venture Partners
    "cfe7ab5a-315e-46bc-8add-e78d47256c1a",  # 100Unicorns
    "b6e5fec6-35f4-df51-ffdc-9923c850572b",  # Betaworks
    "c2a2fb04-afc0-a82a-10c9-918c365e0a65",  # Eurasante
    "5b0c7705-9564-c627-c29e-6ea8a319ea0c",  # SAP.iO
    "fff6ad5c-8261-529f-3ced-7693d43ce71d",  # EIT Digital Accelerator
    "058eba79-fc8b-d913-c7db-018de6663581",  # Ben Franklin Technology Partners
    "21a2f419-e5ad-8482-0622-55fd0ecd2423",  # SparkLabs Accelerator
    "83c4ea22-ea91-8d35-c259-7b5bbd4cdbca",  # FuturePlay
    "2273dd36-1bdb-f4ab-fb04-c84e7413a1a1",  # EvoNexus
    "87b16fdf-a1c8-4ec0-ae9c-323b64cfa70d",  # Orange DAO
    "7191366c-50ab-37f8-89c2-0068f7bf637e",  # Capital Factory
    "0168cb95-fa6f-4b97-804c-e6bcde50ff2d",  # Lotte Ventures
    "c0e55514-6e53-affd-2a23-a7de9e1d54ed",  # Elemental Impact
    "9e50fe29-4fb3-4533-807b-34ec16425b74",  # MassChallenge Switzerland
    "b5175646-b8e0-cce8-0917-c33ec2180c27",  # Hackquarters
    "36409f11-4a22-2564-2708-5834206cf98c",  # South Park Commons
    "73a3a1f2-5cc3-45aa-9b71-e819dd978b74",  # Surge
    "8cbd9845-c791-b45e-047e-c6385a72dc05",  # WILCO
    "60a8d903-841f-c27b-a5dd-e44ab18c0808",  # Accelerator Centre
    "a109eff2-e8ab-4cec-8692-6ed2260a317e",  # Third Derivative
    "e00b0415-24b1-fe8d-1371-0a8e5544d5d9",  # Chinaccelerator
    "2e3878ed-9391-f809-08e7-c1cdfa64a188",  # NDRC
    "cd261b20-809a-62f8-062e-6cecc7065766",  # Haatch
    "2925a807-5ec2-75ee-7a46-d70a345cab09",  # Bethnal Green Ventures
    "c6982e6a-677a-41cd-8f82-8d096ddb2bf4",  # Innovation Capital
    "5a574c3f-ee4f-4802-968e-ad0bbcda270d",  # Target Takeoff
    "07599147-bc19-fe25-9b7f-0232fbea265e",  # DigitalHealth.London Accelerator
    "87438937-2c91-50fb-2489-cee3c1910ad0",  # LaunchVic
    "bc062134-08a6-c2cb-8f51-3fd72d43a8ae",  # FCJ Venture Builder
    "ea427ec8-be0d-4173-b476-9210d26aaf49",  # NC IDEA
    "ad4e3101-a139-9a76-c34d-4d43bc565c33",  # AngelPad
    "1c42fb6d-bc47-4211-bc33-202094380ae8",  # Company Ventures
    "be112a37-4b6b-8ede-f90b-3bfd7fcce335",  # WOW Aceleradora
    "33d6a423-7e5d-4a2a-4d13-fb69dd87734b",  # JioGenNext
    "bb48b54a-516d-6b34-6cfe-73beeb288984",  # MetaProp
    "31cac94c-4db9-6148-3760-d7870eca0f88",  # Bluepoint
    "bf0b2ace-1dfa-f54c-3554-440f50efc756",  # SixThirty
    "d7e775bf-7766-45c5-952a-8f57e1b4e6ab",  # Normandie Incubation
    "1c0f5e1d-71ce-cb0d-4566-51ef647e6075",  # Founder Friendly Labs
    "14cc2cfc-8b47-8c9b-c02a-48c5bccd534c",  # Boomtown Accelerators
    "2b093ae7-7907-4ab3-9b81-415c4c02ab7f",  # India Accelerator
    "4dd8a064-2149-55d5-57d5-2e0bd790ba70",  # Future Fifty
    "9f6ded06-3570-4723-978d-af3e22194a9f",  # MAGIC Fund
}

# ==========================================
# 2. LOGIC (Robust Parsing)
# ==========================================

def check_top_investor(uuid_string: str) -> bool:
    """
    Checks if ANY of the UUIDs in the string are present in the TOP_INVESTOR_IDS set.
    Handles space, comma, or pipe delimiters.
    """
    if pd.isna(uuid_string) or str(uuid_string).strip() == "":
        return False
    
    # 1. Normalize delimiters (Replace comma/pipe with space)
    clean_str = str(uuid_string).replace(",", " ").replace("|", " ")
    
    # 2. Split on whitespace
    tokens = [t.strip() for t in clean_str.split() if t.strip()]
    
    # 3. Check intersection
    return any(token in TOP_INVESTOR_IDS for token in tokens)

# ==========================================
# 3. APPLY TO DATAFRAMES
# ==========================================
print("Calculating 'top_investor' flag...")

for name, df in dfs.items():
    
    # Identify relevant UUID column
    target_uuid_col = None
    if name.endswith("_pre_seed"):
        target_uuid_col = 'uuids_pre_seed'
    elif name.endswith("_angel"):
        target_uuid_col = 'uuids_angel'
    elif name.endswith("_seed") or "seed_to_series_a" in name:
        target_uuid_col = 'uuids_seed'
    elif name.endswith("_series_a"):
        target_uuid_col = 'uuids_series_a'
        
    if target_uuid_col and target_uuid_col in df.columns:
        
        # --- A. Count Missing UUIDs ---
        # A row is "missing" if NaN or empty whitespace
        missing_mask = df[target_uuid_col].isna() | df[target_uuid_col].astype(str).str.strip().eq("")
        missing_count = missing_mask.sum()
        total_rows = len(df)
        
        # --- B. Apply Flag Logic ---
        df['top_investor'] = df[target_uuid_col].apply(check_top_investor)
        
        # --- C. Stats ---
        top_count = df['top_investor'].sum()
        pct_top = (top_count / total_rows) * 100
        pct_missing = (missing_count / total_rows) * 100
        
        print(f"\n[{name}] Target: {target_uuid_col}")
        print(f"  - Total Rows:      {total_rows:,}")
        print(f"  - No Investors:    {missing_count:,} ({pct_missing:.1f}%)")
        print(f"  - Top Investors:   {top_count:,} ({pct_top:.1f}%)")
    else:
        # Pass silently or log if critical
        pass

# Update locals to ensure next cells use the updated DFs
locals().update(dfs)

Calculating 'top_investor' flag...

[df_global_pre_seed] Target: uuids_pre_seed
  - Total Rows:      9,943
  - No Investors:    2,753 (27.7%)
  - Top Investors:   2,168 (21.8%)

[df_global_seed] Target: uuids_seed
  - Total Rows:      65,230
  - No Investors:    18,877 (28.9%)
  - Top Investors:   10,507 (16.1%)

[df_global_series_a] Target: uuids_series_a
  - Total Rows:      29,197
  - No Investors:    3,481 (11.9%)
  - Top Investors:   4,771 (16.3%)

[df_global_seed_to_series_a] Target: uuids_seed
  - Total Rows:      13,527
  - No Investors:    2,440 (18.0%)
  - Top Investors:   3,087 (22.8%)

[df_global_angel] Target: uuids_angel
  - Total Rows:      17,072
  - No Investors:    6,100 (35.7%)
  - Top Investors:   600 (3.5%)

[df_uk_pre_seed] Target: uuids_pre_seed
  - Total Rows:      646
  - No Investors:    216 (33.4%)
  - Top Investors:   113 (17.5%)

[df_uk_seed] Target: uuids_seed
  - Total Rows:      4,500
  - No Investors:    1,120 (24.9%)
  - Top Investors:   779 (17.3%)

[

## Additional Funding Features

Derive additional, modeling-friendly funding covariates.

Example: `first_funding_by` buckets time-to-first-funding into coarse yearly bins to stabilize sparsity and interpretability.


In [31]:
# Purpose: Create `first_funding_by` buckets from TTF columns to stabilize timing features for modeling.

import pandas as pd
import numpy as np

# ==========================================
# 1. DEFINE THE FUNCTION
# ==========================================
def add_first_funding_by(df: pd.DataFrame, ttf_col="ttf_months", out_col="first_funding_by") -> pd.DataFrame:
    # Ensure numeric
    ttf = pd.to_numeric(df[ttf_col], errors="coerce")
    df = df.copy()
    
    # Logic: Uses > 12 and <= 24 to capture floats (e.g. 12.5 months) 
    # that would otherwise fall between bins.
    df[out_col] = np.select(
        [
            ttf <= 12,
            (ttf > 12) & (ttf <= 24),
            (ttf > 24) & (ttf <= 36),
            (ttf > 36) & (ttf <= 48),
            (ttf > 48) & (ttf <= 60),
        ],
        ["1st year", "2nd year", "3rd year", "4th year", "5th year"],
        default="5+ years" # Optional: catch anything over 60 months instead of NA
    )
    
    # Clean up: If ttf was NaN, ensure output is NA (not '5+ years' if used default)
    df.loc[ttf.isna(), out_col] = pd.NA
    
    # Convert to ordered categorical for proper sorting in charts
    cats = ["1st year", "2nd year", "3rd year", "4th year", "5th year", "5+ years"]
    df[out_col] = pd.Categorical(df[out_col], categories=cats, ordered=True)
    
    return df

# ==========================================
# 2. APPLY TO ALL DATAFRAMES
# ==========================================
print("Calculating 'first_funding_by' buckets...")

for name, df in dfs.items():
    
    # 1. Determine the correct TTF column based on DataFrame suffix
    target_ttf_col = None
    
    if name.endswith("_pre_seed"):
        target_ttf_col = "ttf_pre_seed_months"
        
    elif name.endswith("_angel"):
        target_ttf_col = "ttf_angel_months"
        
    elif name.endswith("_seed") or "seed_to_series_a" in name:
        # As requested: For Seed->A, we use the time to SEED.
        target_ttf_col = "ttf_seed_months"
        
    elif name.endswith("_series_a"):
        target_ttf_col = "ttf_series_a_months"
    
    # 2. Apply the function
    if target_ttf_col and target_ttf_col in df.columns:
        dfs[name] = add_first_funding_by(df, ttf_col=target_ttf_col)
        
        # 3. Print verification
        print(f"\n[{name}] Using column: {target_ttf_col}")
        if 'first_funding_by' in dfs[name].columns:
            # Print value counts to show distribution
            dist = dfs[name]['first_funding_by'].value_counts(dropna=True).sort_index()
            print(dist.to_string())
    else:
        print(f"\nSkipping {name}: Column {target_ttf_col} not found.")

# Update locals
locals().update(dfs)
print("\nProcessing complete.")

Calculating 'first_funding_by' buckets...

[df_global_pre_seed] Using column: ttf_pre_seed_months
first_funding_by
1st year    5117
2nd year    3047
3rd year    1779
4th year       0
5th year       0
5+ years       0

[df_global_seed] Using column: ttf_seed_months
first_funding_by
1st year    26387
2nd year    17415
3rd year    10470
4th year     6727
5th year     4231
5+ years        0

[df_global_series_a] Using column: ttf_series_a_months
first_funding_by
1st year    3679
2nd year    5573
3rd year    5667
4th year    5006
5th year    3929
5+ years    5343

[df_global_seed_to_series_a] Using column: ttf_seed_months
first_funding_by
1st year    5501
2nd year    3814
3rd year    2236
4th year    1271
5th year     705
5+ years       0

[df_global_angel] Using column: ttf_angel_months
first_funding_by
1st year    7244
2nd year    3843
3rd year    2200
4th year    1364
5th year     799
5+ years    1622

[df_uk_pre_seed] Using column: ttf_pre_seed_months
first_funding_by
1st year    332
2n

## Extract Founder Names (UK CB-only apples)

Create a helper column `founder_names` by parsing the verbose founder string into a clean list of names.

This is mainly used for diagnostics / later matching, while keeping the original raw field intact.


In [32]:
# Purpose: Parse verbose founder strings into a clean `founder_names` helper column.

import pandas as pd

# ==========================================
# 1. DEFINE EXTRACTION LOGIC
# ==========================================
def extract_names(val):
    """
    Parses strings like "John Doe: CEO | Jane Smith: CTO" 
    into "John Doe | Jane Smith"
    """
    if pd.isna(val) or val == "":
        return pd.NA
        
    names = []
    # Split by pipe for multiple founders
    for chunk in str(val).split("|"):
        chunk = chunk.strip()
        # Split by colon to separate Name from Role (e.g. "Name: Role")
        # We take index [0] to get the Name
        name = chunk.split(":", 1)[0].strip() if ":" in chunk else chunk
        
        if name:
            names.append(name)
            
    return " | ".join(names) if names else pd.NA

# ==========================================
# 2. APPLY TO ALL DATAFRAMES
# ==========================================
print("Extracting founder names from 'founders_descriptions'...")

for name, df in dfs.items():
    
    # Check if the source column exists in this specific dataframe
    if "founders_descriptions" in df.columns:
        # Apply logic
        df["founder_names"] = df["founders_descriptions"].apply(extract_names)
        
        # Verification stats
        filled_count = df["founder_names"].notna().sum()
        print(f"  [{name}] Extracted {filled_count} founder names.")
        
    else:
        # Some subsets might not have this column if it wasn't in the original merge
        # print(f"  [{name}] Skipped (Column 'founders_descriptions' missing)")
        pass

# Update locals so you can access df_usa_seed['founder_names'] directly
locals().update(dfs)

print("\nProcessing complete.")

Extracting founder names from 'founders_descriptions'...
  [df_global_pre_seed] Extracted 6293 founder names.
  [df_global_seed] Extracted 42231 founder names.
  [df_global_series_a] Extracted 20745 founder names.
  [df_global_seed_to_series_a] Extracted 10744 founder names.
  [df_global_angel] Extracted 10204 founder names.
  [df_uk_pre_seed] Extracted 382 founder names.
  [df_uk_seed] Extracted 2679 founder names.
  [df_uk_series_a] Extracted 929 founder names.
  [df_uk_seed_to_series_a] Extracted 623 founder names.
  [df_uk_angel] Extracted 505 founder names.
  [df_uk_cb_only_pre_seed] Extracted 436 founder names.
  [df_uk_cb_only_seed] Extracted 2948 founder names.
  [df_uk_cb_only_series_a] Extracted 1009 founder names.
  [df_uk_cb_only_seed_to_series_a] Extracted 701 founder names.
  [df_uk_cb_only_angel] Extracted 567 founder names.
  [df_usa_pre_seed] Extracted 2758 founder names.
  [df_usa_seed] Extracted 21268 founder names.
  [df_usa_series_a] Extracted 10135 founder names.


## Timing Percentiles

Compute within-group timing percentiles to compare a company’s speed to funding relative to peers.

Key idea:
- Compute percentiles within (`founding_cohort`, `sector`) buckets to control for structural differences across time and industries.


In [33]:
# Purpose: Compute TTF percentile features within (founding_cohort × sector) groups for each subset.

import pandas as pd
import numpy as np

# ==========================================
# 1. DEFINE FUNCTIONS (ADAPTED)
# ==========================================

def add_ttf_percentile(df: pd.DataFrame, ttf_col="ttf_months") -> pd.DataFrame:
    """
    Calculates percentile rank of TTF within (Cohort + Sector).
    Modified to accept dynamic 'ttf_col'.
    """
    # Check if the dynamic column exists
    if ttf_col not in df.columns:
        # If the specific TTF column is missing, return df as is (or raise error)
        return df

    required = ["founding_cohort", "sector"]
    missing = [col for col in required if col not in df.columns]
    if missing:
        # If structural columns are missing, we can't calculate cohorts
        return df 

    result = df.copy()
    
    # Ensure numeric
    result[ttf_col] = pd.to_numeric(result[ttf_col], errors='coerce')
    
    valid = result[ttf_col].notna()

    result["ttf_percentile"] = pd.NA
    valid_df = result.loc[valid].copy()

    if not valid_df.empty:
        # Calculate group sizes
        sizes = valid_df.groupby(["founding_cohort", "sector"], observed=True)[ttf_col].transform("size")
        
        small_idx = valid_df.index[sizes < 10]
        large_idx = valid_df.index[sizes >= 10]

        # Small groups get median (0.5)
        result.loc[small_idx, "ttf_percentile"] = 0.5

        # Large groups get ranked
        if len(large_idx) > 0:
            ranks = (
                result.loc[large_idx]
                .groupby(["founding_cohort", "sector"], observed=True)[ttf_col]
                .rank(method="average", pct=True)
            )
            result.loc[large_idx, "ttf_percentile"] = ranks

    result["ttf_percentile"] = result["ttf_percentile"].astype("Float64")
    return result


def add_ttf_percentile_bins(df, src="ttf_percentile", dest="ttf_percentile_binned"):
    if src not in df.columns:
        return df
        
    out = df.copy()
    pct = pd.to_numeric(out[src], errors="coerce")
    bins = [0, 0.20, 0.40, 0.60, 0.80, 1]
    labels = ["fastest", "faster", "typical", "slower", "slowest"]
    out[dest] = pd.cut(pct, bins=bins, labels=labels, include_lowest=True)
    return out


def add_time_to_1m_percentile_year(df: pd.DataFrame, min_size: int = 10) -> pd.DataFrame:
    # Check if target column exists
    if "time_to_1_mil" not in df.columns:
        return df
        
    required = ["founding_cohort", "sector"]
    missing = [col for col in required if col not in df.columns]
    if missing:
        return df

    out = df.copy()
    valid_mask = out["time_to_1_mil"].notna()

    pct = pd.Series(pd.NA, index=out.index, dtype="Float64")
    
    if valid_mask.any():
        grouped = out.loc[valid_mask].groupby(["founding_cohort", "sector"], observed=True)["time_to_1_mil"]

        for (cohort, sector), series in grouped:
            if len(series) < min_size:
                pct.loc[series.index] = 0.5
            else:
                pct.loc[series.index] = series.rank(method="average", pct=True)

    out["time_to_1_mil_percentile_year"] = pct
    return out


def add_time_to_1m_percentile_year_bins(df, src="time_to_1_mil_percentile_year", dest="time_to_1_mil_percentile_year_binned"):
    if src not in df.columns:
        return df
        
    out = df.copy()
    pct = pd.to_numeric(out[src], errors="coerce")
    bins = [0, 0.20, 0.40, 0.60, 0.80, 1]
    labels = ["fastest", "faster", "typical", "slower", "slowest"]
    out[dest] = pd.cut(pct, bins=bins, labels=labels, include_lowest=True)
    return out

# ==========================================
# 2. EXECUTION LOOP
# ==========================================
print("Calculating Percentiles and Bins...")

# Loop through your dictionary of dataframes (using 'dfs' from previous context)
for name, frame in dfs.items():
    
    # 1. Identify the Correct TTF Column based on dataframe name
    target_ttf_col = None
    
    if name.endswith("_pre_seed"):
        target_ttf_col = "ttf_pre_seed_months"
    elif name.endswith("_angel"):
        target_ttf_col = "ttf_angel_months"
    elif name.endswith("_seed") or "seed_to_series_a" in name:
        target_ttf_col = "ttf_seed_months"
    elif name.endswith("_series_a"):
        target_ttf_col = "ttf_series_a_months"
        
    # 2. Apply Functions
    # Pass the identified column name to the function
    if target_ttf_col:
        frame = add_ttf_percentile(frame, ttf_col=target_ttf_col)
        frame = add_ttf_percentile_bins(frame)
        
        # Apply Time to 1M logic (checks for existence inside function)
        frame = add_time_to_1m_percentile_year(frame)
        frame = add_time_to_1m_percentile_year_bins(frame)
        
        # Save back to dictionary
        dfs[name] = frame
        
        # Log
        print(f"[{name}] Processed using {target_ttf_col}. "
              f"Has percentile? {'ttf_percentile' in frame.columns}")
    else:
        print(f"[{name}] Skipped (No matching TTF column found)")

# Update globals/locals
locals().update(dfs)
print("\nProcessing complete.")

Calculating Percentiles and Bins...
[df_global_pre_seed] Processed using ttf_pre_seed_months. Has percentile? True
[df_global_seed] Processed using ttf_seed_months. Has percentile? True
[df_global_series_a] Processed using ttf_series_a_months. Has percentile? True
[df_global_seed_to_series_a] Processed using ttf_seed_months. Has percentile? True
[df_global_angel] Processed using ttf_angel_months. Has percentile? True
[df_uk_pre_seed] Processed using ttf_pre_seed_months. Has percentile? True
[df_uk_seed] Processed using ttf_seed_months. Has percentile? True
[df_uk_series_a] Processed using ttf_series_a_months. Has percentile? True
[df_uk_seed_to_series_a] Processed using ttf_seed_months. Has percentile? True
[df_uk_angel] Processed using ttf_angel_months. Has percentile? True
[df_uk_cb_only_pre_seed] Processed using ttf_pre_seed_months. Has percentile? True
[df_uk_cb_only_seed] Processed using ttf_seed_months. Has percentile? True
[df_uk_cb_only_series_a] Processed using ttf_series_a_mo

In [34]:
# Purpose: QA script to validate that percentile features are within [0, 1] and inspect anomalies.

import pandas as pd

# ==========================================
# QA / INSPECTION SCRIPT
# ==========================================
print("Running Quality Assurance (QA) on all datasets...")

# Columns that should contain numeric percentiles (0.0 to 1.0)
cols_to_check = [
    "ttf_percentile", 
    "time_to_1_mil_percentile_year"
]

for name, df in dfs.items():
    print(f"\n{'='*10} {name} {'='*10}")

    # 1. CHECK FOR NEGATIVE PERCENTILES
    # (Percentiles should strictly be between 0 and 1)
    for col in cols_to_check:
        if col not in df.columns:
            # Silently skip if column doesn't exist (not all subsets have time_to_1_mil)
            continue
            
        # Convert to numeric (coerce errors) and check for < 0
        numeric_series = pd.to_numeric(df[col], errors="coerce")
        neg_mask = numeric_series.lt(0)
        neg_count = int(neg_mask.sum())
        
        if neg_count > 0:
            print(f"  [WARNING] {col}: {neg_count} negative values found.")
            # Display sample of bad rows
            cols_to_show = [c for c in [col, "founding_cohort", "sector", "uuid"] if c in df.columns]
            print(df.loc[neg_mask, cols_to_show].head())
        else:
            # Optional: Confirm it passed
            # print(f"  [PASS] {col}: No negatives.")
            pass

    # 2. CROSSTAB (Sample Size Check)
    # This helps verify if cohorts/sectors are large enough for ranking logic
    if {"founding_cohort", "sector"}.issubset(df.columns):
        print(f"  > Sample Sizes by Cohort/Sector (Top 5 rows):")
        
        # Create Crosstab
        size_ct = pd.crosstab(df["founding_cohort"], df["sector"], margins=True)
        
        # Print only the first few rows/cols to avoid spamming the console
        # (Adjust .iloc limits to see more)
        print(size_ct.iloc[-6:, :].to_string()) 
        # Printing the last 6 rows usually shows the most recent years + 'All'
    else:
        print("  [SKIP] Missing founding_cohort/sector columns.")

print("\nQA Complete.")

Running Quality Assurance (QA) on all datasets...

  > Sample Sizes by Cohort/Sector (Top 5 rows):
sector           Financial Services & Fintech  Health & Life Sciences  IT & Data Infrastructure  Marketing & Advertising  Media & Entertainment  Other  Professional & Business Services  Retail & E-Commerce  Software & SaaS   All
founding_cohort                                                                                                                                                                                                                     
2007-2009                                   9                      21                         5                        7                     19     48                                 3                    9               71   192
2010-2013                                  88                     115                       132                       49                    117    413                                27                   79         

## VC Aggregate Capital Raised

Augment datasets with a macro environment proxy: total VC capital raised in the prior year.

Produces:
- `vc_cap_raised`: prior-year macro VC raised relative to the funding event.
- `vc_cap_raised_prior_founding`: prior-year macro VC raised relative to the founding year.


In [35]:
# Purpose: Map macro VC capital raised (prior year) to each company by funding-year and by founding-year.

import pandas as pd
import numpy as np

# ==========================================
# 1. DEFINE MACRO DATA (Unchanged)
# ==========================================
us_vc_cap_raised = {
    2024: 72.86, 2023: 74.02, 2022: 146.38, 2021: 161.89, 2020: 86.63,
    2019: 56.43, 2018: 48.36, 2017: 35.54, 2016: 40.43, 2015: 35.44,
    2014: 32.75, 2013: 18.85, 2012: 27.86, 2011: 17.39, 2010: 14.85,
    2009: 16.16, 2008: 30.69, 2007: 27.36, 2006: 29.17, 2005: 23.50,
    2004: 20.27, 2003: 9.44, 2002: 11.22, 2001: 35.27, 2000: 59.97,
    1999: 45.00, 1998: 20.00, # Added some historical buffers just in case
}

uk_vc_cap_raised = {
    2025: 1.95, 2024: 5.69, 2023: 5.12, 2022: 7.36, 2021: 8.72,
    2020: 7.48, 2019: 3.79, 2018: 2.94, 2017: 4.19, 2016: 2.84,
    2015: 2.07, 2014: 2.90, 2013: 1.00, 2012: 0.81, 2011: 0.87,
    2010: 0.85, 2009: 1.54, 2008: 0.50, 2007: 2.27, 2006: 2.35,
    2005: 1.01, 2004: 0.79, 2003: 0.85, 2002: 0.66, 2001: 1.97,
    2000: 2.39,
}

# ==========================================
# 2. DEFINE LOGIC (UPDATED: Flexible Target)
# ==========================================

def add_vc_cap_raised_lag(df, mapping, source_date_col, target_col_name):
    """
    Maps VC capital raised in the year prior to the source_date_col.
    """
    df = df.copy()
    
    # 1. Calculate Lagged Year
    # Subtract 1 because we want the capital raised in the PREVIOUS year
    years_minus_one = pd.to_datetime(df[source_date_col], errors="coerce").dt.year - 1
    
    # 2. Map Raw Values
    df[target_col_name] = years_minus_one.map(mapping)
    
    return df

# ==========================================
# 3. APPLY TO ALL DATAFRAMES
# ==========================================
print("Mapping Macro VC Capital Data (Prior to Round & Prior to Founding)...")

for name, df in dfs.items():
    
    # --- A. Identify Correct Macro Dataset ---
    target_mapping = None
    mapping_name = ""
    
    if "uk" in name:
        target_mapping = uk_vc_cap_raised
        mapping_name = "UK Data"
    else:
        target_mapping = us_vc_cap_raised
        mapping_name = "US Data"

    # --- B. Identify Correct Date Columns ---
    # 1. Funding Round Date (for 'vc_cap_raised')
    funding_date_col = None
    if name.endswith("_pre_seed"):
        funding_date_col = "date_pre_seed"
    elif name.endswith("_angel"):
        funding_date_col = "date_angel"
    elif name.endswith("_seed") or "seed_to_series_a" in name:
        funding_date_col = "date_seed"
    elif name.endswith("_series_a"):
        funding_date_col = "date_series_a"
        
    # 2. Founding Date (for 'vc_cap_raised_prior_founding')
    founding_date_col = "founded_on"

    # --- C. Apply Mappings ---
    
    # Apply 1: Prior to Funding Event
    if funding_date_col and funding_date_col in df.columns:
        dfs[name] = add_vc_cap_raised_lag(
            df=dfs[name], 
            mapping=target_mapping, 
            source_date_col=funding_date_col,
            target_col_name="vc_cap_raised"
        )
        filled_1 = dfs[name]["vc_cap_raised"].notna().sum()
        print(f"[{name}] Event Macro ({funding_date_col}): Mapped {filled_1} rows.")
    else:
        print(f"[{name}] Event Macro: Skipped (Date col missing)")

    # Apply 2: Prior to Founding (NEW)
    if founding_date_col in df.columns:
        dfs[name] = add_vc_cap_raised_lag(
            df=dfs[name],
            mapping=target_mapping,
            source_date_col=founding_date_col,
            target_col_name="vc_cap_raised_prior_founding"
        )
        filled_2 = dfs[name]["vc_cap_raised_prior_founding"].notna().sum()
        print(f"[{name}] Founding Macro: Mapped {filled_2} rows.")
    else:
        print(f"[{name}] Founding Macro: Skipped ('founded_on' missing)")

# Update locals
locals().update(dfs)
print("\nProcessing complete.")

Mapping Macro VC Capital Data (Prior to Round & Prior to Founding)...
[df_global_pre_seed] Event Macro (date_pre_seed): Mapped 9943 rows.
[df_global_pre_seed] Founding Macro: Mapped 9943 rows.
[df_global_seed] Event Macro (date_seed): Mapped 65230 rows.
[df_global_seed] Founding Macro: Mapped 65230 rows.
[df_global_series_a] Event Macro (date_series_a): Mapped 29197 rows.
[df_global_series_a] Founding Macro: Mapped 29197 rows.
[df_global_seed_to_series_a] Event Macro (date_seed): Mapped 13527 rows.
[df_global_seed_to_series_a] Founding Macro: Mapped 13527 rows.
[df_global_angel] Event Macro (date_angel): Mapped 17072 rows.
[df_global_angel] Founding Macro: Mapped 17072 rows.
[df_uk_pre_seed] Event Macro (date_pre_seed): Mapped 646 rows.
[df_uk_pre_seed] Founding Macro: Mapped 646 rows.
[df_uk_seed] Event Macro (date_seed): Mapped 4500 rows.
[df_uk_seed] Founding Macro: Mapped 4500 rows.
[df_uk_series_a] Event Macro (date_series_a): Mapped 1269 rows.
[df_uk_series_a] Founding Macro: Map

## Second Filter

Apply an additional closure-consistency filter at the *round level*.

Goal: ensure the company is not marked as closed before the relevant round date (e.g., closed before Seed for Seed cohorts).


In [36]:
# Purpose: Apply a round-specific closure-consistency filter (drop firms closed before the relevant round date).

import pandas as pd
import numpy as np

# ==========================================
# 1. SETUP & DEFINITIONS
# ==========================================

# Helper to identify the relevant columns for a specific dataframe
def get_target_cols(df_name):
    if df_name.endswith("_pre_seed"):
        return "date_pre_seed", "amount_pre_seed", "ttf_pre_seed_months"
    elif df_name.endswith("_angel"):
        return "date_angel", "amount_angel", "ttf_angel_months"
    elif df_name.endswith("_seed") or "seed_to_series_a" in df_name:
        return "date_seed", "amount_seed", "ttf_seed_months"
    elif df_name.endswith("_series_a"):
        return "date_series_a", "amount_series_a", "ttf_series_a_months"
    return None, None, None

# ==========================================
# 2. LOGIC
# ==========================================

def _invalid_closed(frame: pd.DataFrame, funding_col: str) -> pd.Series:
    """Checks if company closed BEFORE the specific funding round."""
    closed_dt = pd.to_datetime(frame["closed_on"], errors="coerce")
    funding_dt = pd.to_datetime(frame[funding_col], errors="coerce")
    
    status_closed = frame["status"].astype(str).str.strip().str.lower().eq("closed")
    
    # Error 1: Status is closed but no closed date
    missing_closed = status_closed & frame["closed_on"].isna()
    
    # Error 2: Closed date is earlier than funding date (Time Travel)
    closed_before_funding = (
        closed_dt.notna() & funding_dt.notna() & (closed_dt <= funding_dt)
    )
    
    return missing_closed | closed_before_funding

def build_apples(df: pd.DataFrame, name: str) -> pd.DataFrame:
    # 1. Determine dynamic columns
    date_col, amt_col, ttf_col = get_target_cols(name)
    
    if not date_col:
        print(f"Skipping {name}: Could not determine target columns.")
        return df

    # 2. Define Base Requirements
    required_cols = [
        "status",
        "sector",
        "founding_cohort",
        "closed_on",
        "founder_education",
        "investor_type",
        "top_investor",
        date_col,
        amt_col,
        ttf_col
    ]
    
    missing = set(required_cols) - set(df.columns)
    critical_missing = [c for c in missing if c != "being_funded"]
    if critical_missing:
        print(f"[{name}] CRITICAL: Missing columns {critical_missing}")
        return df

    # 3. Define Filters
    # Each lambda returns True if the row should be REMOVED (The "Bad" Mask)
    filter_steps = {
        # Metadata Filters
        "status": lambda f: f["status"].isna() | f["status"].astype(str).str.strip().eq(""),
        "sector": lambda f: f["sector"].isna() | f["sector"].astype(str).str.strip().eq(""),
        "founding_cohort": lambda f: f["founding_cohort"].isna(),
        "founder_education": lambda f: f["founder_education"].isna() | f["founder_education"].astype(str).str.strip().eq(""),
        
        # Funding & Date Filters
        date_col: lambda f: f[date_col].isna(),
        "closed_on": lambda f: _invalid_closed(f, date_col),
        
        # --- CHANGED: TTF FILTER ---
        # Only remove if NaN (missing). 
        # We now KEEP rows > 60 months.
        ttf_col: lambda f: f[ttf_col].isna(),
        
        # Amount must be non-empty
        amt_col: lambda f: f[amt_col].isna() | f[amt_col].astype(str).str.strip().eq(""),
        
        # --- CHANGED: INVESTOR TYPE FILTER ---
        # Only remove if NaN. 
        # We now KEEP 'INSTITUTIONAL' and 'PRIVATE_EQUITY'.
        "investor_type": lambda f: f["investor_type"].isna(),
        
        # Top Investor (Must exist)
        "top_investor": lambda f: f["top_investor"].isna()
    }

    # 4. Execute Filters
    kept = df.copy()
    initial_count = len(kept)
    removed_by_col = {}

    for col, filter_func in filter_steps.items():
        # Apply mask
        mask_to_remove = filter_func(kept)
        count_removed = mask_to_remove.sum()
        
        removed_by_col[col] = count_removed
        
        # Keep the good ones
        kept = kept.loc[~mask_to_remove].copy()

    # 5. Report
    total_removed = initial_count - len(kept)
    
    details = ", ".join(f"{k}={v}" for k, v in removed_by_col.items() if v > 0)
    print(f"[{name}] Started: {initial_count} -> Kept: {len(kept)} (Dropped {total_removed})")
    if details:
        print(f"   Details: {details}")
    print("-" * 60)
    
    return kept

# ==========================================
# 3. APPLY TO ALL DFS (IN PLACE)
# ==========================================

print("Applying Filters to Datasets (In-Place)...")

for key, frame in dfs.items():
    # Overwrite the dataframe in the dictionary directly
    dfs[key] = build_apples(frame, key)

# Update locals
locals().update(dfs)

print("\nProcessing complete. Original DataFrames have been updated.")

Applying Filters to Datasets (In-Place)...
[df_global_pre_seed] Started: 9943 -> Kept: 9442 (Dropped 501)
   Details: closed_on=501
------------------------------------------------------------
[df_global_seed] Started: 65230 -> Kept: 59611 (Dropped 5619)
   Details: closed_on=5619
------------------------------------------------------------
[df_global_series_a] Started: 29197 -> Kept: 28111 (Dropped 1086)
   Details: closed_on=1086
------------------------------------------------------------
[df_global_seed_to_series_a] Started: 13527 -> Kept: 13255 (Dropped 272)
   Details: closed_on=272
------------------------------------------------------------
[df_global_angel] Started: 17072 -> Kept: 15800 (Dropped 1272)
   Details: closed_on=1272
------------------------------------------------------------
[df_uk_pre_seed] Started: 646 -> Kept: 606 (Dropped 40)
   Details: founding_cohort=11, closed_on=29
------------------------------------------------------------
[df_uk_seed] Started: 4500 -> 

## Normalize Missing Categorical Labels

(Disabled / optional)

Standardize key categorical covariates by mapping empty strings to missing values (or a sentinel like `Unobserved`) before exporting.

The code is kept commented-out because the final choice of missing-value treatment is a modeling decision.


In [37]:

# Purpose: (Commented-out) Optional normalization of missing categorical labels prior to export.

# targets = [
#     "df_global_apples",
#     "df_uk_apples",
#     "df_uk_cb_only_apples",
#     "df_usa_apples",
# ]
# cols = ["founder_uni_reputation", "investor_type", "founding_team_diversity", "founding_team_size", "founder_education"]

# for name in targets:
#     df = globals().get(name)
#     if df is None:
#         print(f"Skipping {name}: not found")
#         continue

#     df = df.copy()
#     for col in cols:
#         if col not in df.columns:
#             print(f"{name}: column '{col}' missing")
#             continue
#         df[col] = (
#             df[col]
#             .astype("string")
#             .str.strip()
#             .replace("", pd.NA)
#             .replace("NaN", pd.NA)
#             .fillna("Unobserved")
#         )

#     globals()[name] = df

#     for col in cols:
#         if col in df.columns:
#             print(f"{name} | {col}: Unobserved count = {(df[col] == 'Unobserved').sum()}")


## Cleanup, Ordering, and Types

Final housekeeping before export:
- Select and order a consistent column set.
- Enforce dtypes for dates, categoricals, and numerics.
- Run strict chronology checks on stage dates (drop logically impossible orderings).


In [38]:
# Purpose: Define a consistent ordered column set for export across all round-level subsets.

import pandas as pd

# ==========================================
# 1. DEFINE COLUMN GROUPS
# ==========================================

# Core Metadata
CORE_COLS = [
    "org_uuid", "org_name", "legal_name", "homepage_url",
    "org_country", "org_city", "sector",
    "status",
    "founded_on", "closed_on", "went_public_on", "acquired_on_first",
    "founding_cohort", "founding_team_size", "founding_team_diversity",
    "founder_education", "founder_uni_reputation", "founder_names",
    "employee_count", "vc_cap_raised", "vc_cap_raised_raw", "prior_founding_experience", "vc_cap_raised_prior_founding"
]

# Analysis Metrics (Funding & Time)
# UPDATE: Added potential log versions for time_to_1_mil just in case
METRICS_COLS = [
    "total_funding_usd", "num_funding_rounds",
    "investor_type", "top_investor",
    "time_to_1_mil", "time_to_1_mil_raw", 
    "time_to_1_mil_log", "time_to_1_mil_log_raw", # Added these
    "ttf_percentile", "ttf_percentile_binned",
    "first_funding_by", 
]

# --- ALL STAGE COLUMNS (Keep in every DF to preserve history) ---
ALL_STAGE_COLS = [
    # Pre-Seed
    "date_pre_seed", "amount_pre_seed", "uuids_pre_seed",
    # Angel
    "date_angel", "amount_angel", "uuids_angel",
    # Seed
    "date_seed", "amount_seed", "uuids_seed",
    # Series A
    "date_series_a", "amount_series_a", "uuids_series_a",
    # Series B
    "date_series_b", "amount_series_b", "uuids_series_b",
    # Series C
    "date_series_c", "amount_series_c", "uuids_series_c"
]

# --- ALL LOGICAL ROUNDS (Keep in every DF to preserve history) ---
LOGICAL_ROUND_COLS = []
for i in range(1, 11): # Generates logical_round_1 to 10
    LOGICAL_ROUND_COLS.extend([
        f"logical_round_{i}",
        f"logical_round_{i}_date",
        f"logical_round_{i}_amount",
        f"logical_round_{i}_uuids"
    ])

# ==========================================
# 2. DYNAMIC REORDERING FUNCTION
# ==========================================

def reorder_dataframe(df, name):
    # 1. Identify Subset-Specific TTF Columns
    # We prioritize the TTF column relevant to this specific subset
    # so it appears early in the dataframe
    subset_ttf_cols = []
    
    # UPDATE: Added _log and _log_raw variants to all blocks below
    if "_pre_seed" in name:
        subset_ttf_cols = [
            "ttf_pre_seed_months", "ttf_pre_seed_months_raw",
            "ttf_pre_seed_months_log", "ttf_pre_seed_months_log_raw"
        ]
        
    elif "_angel" in name:
        subset_ttf_cols = [
            "ttf_angel_months", "ttf_angel_months_raw",
            "ttf_angel_months_log", "ttf_angel_months_log_raw"
        ]
        
    elif "_seed" in name: # Matches both _seed and _seed_to_series_a
        subset_ttf_cols = [
            "ttf_seed_months", "ttf_seed_months_raw",
            "ttf_seed_months_log", "ttf_seed_months_log_raw"
        ]
        
        if "seed_to_series_a" in name:
            subset_ttf_cols.extend([
                "months_seed_to_series_a", "months_seed_to_series_a_raw",
                "months_seed_to_series_a_log", "months_seed_to_series_a_log_raw"
            ])
            
    elif "_series_a" in name:
        subset_ttf_cols = [
            "ttf_series_a_months", "ttf_series_a_months_raw",
            "ttf_series_a_months_log", "ttf_series_a_months_log_raw"
        ]

    # 2. Construct Final Order preference
    # Priority: 
    #   1. Core Metadata
    #   2. Subset-Specific TTF (The most important time metric for this file)
    #   3. Analysis Metrics (Total funding, Investor Type, Time to 1M)
    #   4. All Specific Stages (Pre-Seed -> Series C)
    #   5. All Logical Rounds (Full history 1-10)
    
    final_order = (
        CORE_COLS + 
        subset_ttf_cols + 
        METRICS_COLS + 
        ALL_STAGE_COLS + 
        LOGICAL_ROUND_COLS
    )
    
    # 3. Filter: Only keep columns that actually exist in this dataframe
    # (This prevents crashes if a subset somehow missed a column)
    existing_cols = [c for c in final_order if c in df.columns]
    
    return df[existing_cols]

# ==========================================
# 3. EXECUTE CLEANUP & REORDERING
# ==========================================
print("Cleaning and Reordering columns...")

# List of junk columns to drop before reordering
cols_to_drop_junk = [
    "parent_backed", "parent_name", "region", "founders_countries",
    "founders_female_count", "founders_male_count", "founders_descriptions.1",
    "founders_has_phd", "founders_has_mba", "founders_has_masters", 
    "founders_has_bachelors", "founders_degrees", "category_list", 
    "category_groups_list", "parent_uuid", "found_year", 
    "ch_match_score", "ch_match_source", "ch_last_nm01_date", "ch_last_cs01_date",
    "ch_match_reason", "ch_last_sh01_date", "ch_company_status", "ch_company_name",
    "ch_date_of_creation", "ch_link_officers", "ch_search_query", "ch_company_number",
    "ch_link_filing_history", "ch_search_url", "ch_date_of_cessation"
]

for name, df in dfs.items():
    # 1. Drop junk columns to save memory and clean view
    dfs[name] = df.drop(columns=cols_to_drop_junk, errors="ignore")
    
    # 2. Reorder columns
    dfs[name] = reorder_dataframe(dfs[name], name)
    
    print(f"[{name}] Final Shape: {dfs[name].shape}")

# Update locals
locals().update(dfs)
print("\nProcessing Complete.")

# ==========================================
# 4. FINAL VALIDATION
# ==========================================
# Check if we successfully kept history columns
if "df_usa_seed" in dfs:
    print("\n--- Validation on 'df_usa_seed' ---")
    cols = dfs["df_usa_seed"].columns
    
    # Check for Series B (Future history)
    has_b = "date_series_b" in cols
    # Check for Logical Round 10 (Deep history)
    has_log10 = "logical_round_10" in cols
    # Check for Investor Type (New Metric)
    has_inv = "investor_type" in cols
    # Check for Log columns (New Metric)
    has_log = "ttf_seed_months_log" in cols
    
    print(f"Has Series B column? {has_b}")
    print(f"Has Logical Round 10? {has_log10}")
    print(f"Has Investor Type?    {has_inv}")
    print(f"Has TTF Log?          {has_log}")

Cleaning and Reordering columns...
[df_global_pre_seed] Final Shape: (9442, 89)
[df_global_seed] Final Shape: (59611, 89)
[df_global_series_a] Final Shape: (28111, 89)
[df_global_seed_to_series_a] Final Shape: (13255, 90)
[df_global_angel] Final Shape: (15800, 89)
[df_uk_pre_seed] Final Shape: (606, 89)
[df_uk_seed] Final Shape: (4199, 89)
[df_uk_series_a] Final Shape: (1233, 89)
[df_uk_seed_to_series_a] Final Shape: (795, 90)
[df_uk_angel] Final Shape: (723, 89)
[df_uk_cb_only_pre_seed] Final Shape: (671, 89)
[df_uk_cb_only_seed] Final Shape: (4504, 89)
[df_uk_cb_only_series_a] Final Shape: (1329, 89)
[df_uk_cb_only_seed_to_series_a] Final Shape: (881, 90)
[df_uk_cb_only_angel] Final Shape: (796, 89)
[df_usa_pre_seed] Final Shape: (3471, 89)
[df_usa_seed] Final Shape: (26277, 89)
[df_usa_series_a] Final Shape: (11670, 89)
[df_usa_seed_to_series_a] Final Shape: (6495, 90)
[df_usa_angel] Final Shape: (3573, 89)

Processing Complete.

--- Validation on 'df_usa_seed' ---
Has Series B colu

In [39]:
# Purpose: Enforce dtype consistency (dates, ints, floats, categoricals) and create raw/log variants where needed.

import pandas as pd
import numpy as np

# ==========================================
# 1. DEFINE EXPANDED COLUMN LISTS
# ==========================================

date_cols = [
    "founded_on", 
    "closed_on", "went_public_on", "acquired_on_first",
    # New Stage Dates
    "date_pre_seed", "date_angel", "date_seed", 
    "date_series_a", "date_series_b", "date_series_c"
]

int_cols = [
    "prior_founding_experience",
    "num_funding_rounds"
]

float_cols = [
    "ttf_days", "ttf_months", "time_to_1_mil", "total_funding_usd",
    "vc_cap_raised",
    # New TTF Metrics
    "ttf_pre_seed_months", "ttf_pre_seed_months_raw", "ttf_pre_seed_months_log",
    "ttf_angel_months", "ttf_angel_months_raw", "ttf_angel_months_log",
    "ttf_seed_months", "ttf_seed_months_raw", "ttf_seed_months_log",
    "ttf_series_a_months", "ttf_series_a_months_raw", "ttf_series_a_months_log",
    "months_seed_to_series_a", "months_seed_to_series_a_raw", "months_seed_to_series_a_log",
    "time_to_1_mil_raw", "ttf_percentile", "vc_cap_raised_raw", "vc_cap_raised_prior_founding"
]

bool_cols = [
    "top_investor" # Moved from int to bool based on previous logic
]

# Generate logical round columns (strings)
logical_cols = []
for i in range(1, 11): 
    logical_cols.extend([
        f"logical_round_{i}", f"logical_round_{i}_date",
        f"logical_round_{i}_amount", f"logical_round_{i}_uuids"
    ])

string_cols = [
    "org_uuid", "org_name", "legal_name", "homepage_url",
    "org_country", "org_city", "founder_names",
    "first_round_investor_uuid", "first_funding_leads",
    # ID and Amount Columns (Amounts often contain currency text)
    "uuids_pre_seed", "amount_pre_seed",
    "uuids_angel", "amount_angel",
    "uuids_seed", "amount_seed",
    "uuids_series_a", "amount_series_a",
    "uuids_series_b", "amount_series_b",
    "uuids_series_c", "amount_series_c",
] + logical_cols

cat_cols = [
    "status", "founding_cohort", "founding_team_diversity",
    "founder_education", "founder_uni_reputation", "sector",
    "investor_type", # New calculated column
    "first_funding_by", "ttf_percentile_binned", "ttf_percentile_year_binned",
    "time_to_1_mil_percentile_year_binned",
    "founding_team_size" # Keeping as category if you prefer, or move to int
]

# ==========================================
# 2. DEFINE CASTING FUNCTION
# ==========================================

def set_types(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    
    # Dates
    for col in date_cols:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors="coerce")
            
    # Integers (Nullable)
    for col in int_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce").astype("Int64")
            
    # Floats
    for col in float_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
            
    # Booleans
    for col in bool_cols:
        if col in df.columns:
            df[col] = df[col].astype("boolean")
            
    # Strings
    for col in string_cols:
        if col in df.columns:
            df[col] = df[col].astype("string")
            
    # Categories
    for col in cat_cols:
        if col in df.columns:
            df[col] = df[col].astype("category")
            
    return df

# ==========================================
# 3. APPLY TO DATAFRAMES
# ==========================================
print("Setting Data Types...")

for name, df in dfs.items():
    # Apply type casting
    dfs[name] = set_types(df)
    print(f"Typed {name}: {len(dfs[name])} rows")

# Update locals
locals().update(dfs)
print("\nType setting complete.")

Setting Data Types...
Typed df_global_pre_seed: 9442 rows
Typed df_global_seed: 59611 rows
Typed df_global_series_a: 28111 rows
Typed df_global_seed_to_series_a: 13255 rows
Typed df_global_angel: 15800 rows
Typed df_uk_pre_seed: 606 rows
Typed df_uk_seed: 4199 rows
Typed df_uk_series_a: 1233 rows
Typed df_uk_seed_to_series_a: 795 rows
Typed df_uk_angel: 723 rows
Typed df_uk_cb_only_pre_seed: 671 rows
Typed df_uk_cb_only_seed: 4504 rows
Typed df_uk_cb_only_series_a: 1329 rows
Typed df_uk_cb_only_seed_to_series_a: 881 rows
Typed df_uk_cb_only_angel: 796 rows
Typed df_usa_pre_seed: 3471 rows
Typed df_usa_seed: 26277 rows
Typed df_usa_series_a: 11670 rows
Typed df_usa_seed_to_series_a: 6495 rows
Typed df_usa_angel: 3573 rows

Type setting complete.


In [40]:
# Purpose: Drop observations with logically impossible stage-date ordering after coercing to datetime.

import pandas as pd

# ==========================================
# STRICT CHRONOLOGY FILTERING (ROBUST)
# ==========================================

print("Applying STRICT chronology sanity checks...")

# 1. Define the logical maturity order
# Any stage on the left appearing later in time than a stage on the right is an error.
stage_order = [
    'date_pre_seed',
    'date_seed',
    'date_series_a',
    'date_series_b',
    'date_series_c'
]

total_dropped_global = 0

for name, df in dfs.items():
    
    # --- A. Force Datetime Conversion (Crucial Fix) ---
    # We ensure all stage columns are actual Datetimes, not Strings.
    # If they are strings, "10/01/2023" > "01/01/2024" might evaluate incorrectly.
    for col in stage_order:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors='coerce')

    initial_count = len(df)
    mask_drop = pd.Series(False, index=df.index)
    
    # --- B. Brute Force Comparison ---
    # Compare every early stage against every late stage
    for i in range(len(stage_order)):
        early_col = stage_order[i]
        
        for j in range(i + 1, len(stage_order)):
            late_col = stage_order[j]
            
            # We can only check if BOTH columns exist in this dataframe
            if early_col in df.columns and late_col in df.columns:
                
                # Logic: Drop if Early Date > Late Date
                # (e.g. If Seed Date > Series A Date)
                bad_rows = (
                    df[early_col].notna() & 
                    df[late_col].notna() & 
                    (df[early_col] > df[late_col])
                )
                
                count = bad_rows.sum()
                if count > 0:
                    print(f"  [{name}] Found {count} errors: {early_col} is after {late_col}")
                    mask_drop = mask_drop | bad_rows

    # --- C. Apply Drop ---
    if mask_drop.any():
        df_clean = df[~mask_drop].copy()
        dropped = initial_count - len(df_clean)
        
        # Update the dictionary
        dfs[name] = df_clean
        total_dropped_global += dropped
        
        print(f"  -> Dropped {dropped} rows from {name}. Remaining: {len(df_clean)}")
    # else:
    #     print(f"  [{name}] Clean.")

# Update locals
locals().update(dfs)

print(f"\nTotal rows dropped across all datasets: {total_dropped_global}")

Applying STRICT chronology sanity checks...

Total rows dropped across all datasets: 0


## Export Feature Tables

Write each dataframe in `dfs` to a separate CSV file for downstream analysis.

Filenames follow the dataframe key (e.g., `df_usa_seed.csv`).


In [41]:
# Purpose: Export all datasets in `dfs` to CSV files named after their dictionary keys.

import pandas as pd
from pathlib import Path

# ==========================================
# EXPORT TO CSV
# ==========================================

export_dir = Path(".")
print(f"Exporting {len(dfs)} datasets to {export_dir.absolute()} ...\n")

for name, df in dfs.items():
    # Construct filename: df_usa_seed -> df_usa_seed.csv
    filename = f"{name}.csv"
    csv_path = export_dir / filename
    
    # Export
    df.to_csv(csv_path, index=False)
    
    print(f"Wrote {len(df):,} rows to {filename}")

print("\nAll exports complete.")

Exporting 20 datasets to /Users/stefan/Desktop/Thesis/v4/Study/constructing features ...

Wrote 9,442 rows to df_global_pre_seed.csv
Wrote 59,611 rows to df_global_seed.csv
Wrote 28,111 rows to df_global_series_a.csv
Wrote 13,255 rows to df_global_seed_to_series_a.csv
Wrote 15,800 rows to df_global_angel.csv
Wrote 606 rows to df_uk_pre_seed.csv
Wrote 4,199 rows to df_uk_seed.csv
Wrote 1,233 rows to df_uk_series_a.csv
Wrote 795 rows to df_uk_seed_to_series_a.csv
Wrote 723 rows to df_uk_angel.csv
Wrote 671 rows to df_uk_cb_only_pre_seed.csv
Wrote 4,504 rows to df_uk_cb_only_seed.csv
Wrote 1,329 rows to df_uk_cb_only_series_a.csv
Wrote 881 rows to df_uk_cb_only_seed_to_series_a.csv
Wrote 796 rows to df_uk_cb_only_angel.csv
Wrote 3,471 rows to df_usa_pre_seed.csv
Wrote 26,277 rows to df_usa_seed.csv
Wrote 11,670 rows to df_usa_series_a.csv
Wrote 6,495 rows to df_usa_seed_to_series_a.csv
Wrote 3,573 rows to df_usa_angel.csv

All exports complete.


## Diagnostics

Quick QA / sanity checks to understand the resulting samples:
- Counts of IPO / acquired / closed statuses.
- Covariate coverage (missingness) and positivity checks (sparsity / zero cells) by status.


In [42]:
# Purpose: Diagnostic counts of IPO/acquired/closed statuses per subset.

import pandas as pd

status_counts = {}

print("Counting IPO and Acquired statuses...")

# Iterate through the 'dfs' dictionary
for name, df in dfs.items():
    if "status" not in df.columns:
        continue

    # Normalize status to ensure we catch 'IPO', 'ipo', 'Acquired', 'acquired '
    status_series = df["status"].astype(str).str.strip().str.lower()
    counts = status_series.value_counts()
    
    status_counts[name] = {
        "ipo": int(counts.get("ipo", 0)),
        "acquired": int(counts.get("acquired", 0)),
        "closed": int(counts.get("closed", 0)), # Added for context
        "total": len(df)
    }

if status_counts:
    print("\nIPO and Acquired counts per DataFrame:")
    # Sort by name for cleaner output
    for name, counts in sorted(status_counts.items()):
        print(f" - {name:<30}: IPO={counts['ipo']:<5,} | Acquired={counts['acquired']:<6,} | Closed={counts['closed']:<6,} (Total: {counts['total']:,})")
else:
    print("No DataFrames with a status column were found for counting.")

Counting IPO and Acquired statuses...

IPO and Acquired counts per DataFrame:
 - df_global_angel               : IPO=367   | Acquired=1,439  | Closed=1,072  (Total: 15,800)
 - df_global_pre_seed            : IPO=53    | Acquired=1,063  | Closed=514    (Total: 9,442)
 - df_global_seed                : IPO=525   | Acquired=8,812  | Closed=5,913  (Total: 59,611)
 - df_global_seed_to_series_a    : IPO=256   | Acquired=3,107  | Closed=443    (Total: 13,255)
 - df_global_series_a            : IPO=931   | Acquired=6,006  | Closed=1,147  (Total: 28,111)
 - df_uk_angel                   : IPO=2     | Acquired=101    | Closed=159    (Total: 723)
 - df_uk_cb_only_angel           : IPO=4     | Acquired=112    | Closed=114    (Total: 796)
 - df_uk_cb_only_pre_seed        : IPO=2     | Acquired=65     | Closed=55     (Total: 671)
 - df_uk_cb_only_seed            : IPO=29    | Acquired=553    | Closed=587    (Total: 4,504)
 - df_uk_cb_only_seed_to_series_a: IPO=11    | Acquired=182    | Closed=26    

In [43]:
# Purpose: Diagnostics for modeling assumptions: covariate coverage and positivity (no zero cells) checks.

import pandas as pd

# ==========================================
# 1. SETUP
# ==========================================

covariates = [
    "ttf_percentile_binned",
    "founding_cohort",
    "sector",
    "prior_founding_experience",
    "founder_education",
    "founder_uni_reputation",
    "founding_team_size",
    "founding_team_diversity",
    "investor_type",
    "top_investor",
    "vc_cap_raised", 
    "time_to_1_mil_percentile_year_binned"
]

def prep(df):
    df = df.copy()
    # Normalize naming if needed
    if "founders_count" in df.columns and "founding_team_size" not in df.columns:
        df = df.rename(columns={"founders_count": "founding_team_size"})
    return df

def coverage_and_positivity(df, name):
    df = prep(df)
    print(f"\n{'='*10} {name} overall coverage (n={len(df)}) {'='*10}")
    
    # --- 1. Coverage ---
    rows, present_covs = [], []
    for cov in covariates:
        if cov not in df.columns:
            rows.append((cov, "absent", "absent", "absent", "absent"))
            continue
            
        non_empty = df[cov].notna().sum()
        rows.append(
            (cov, non_empty, len(df) - non_empty,
             f"{non_empty/len(df):.2%}", f"{(len(df)-non_empty)/len(df):.2%}")
        )
        present_covs.append(cov)
        
    print(pd.DataFrame(rows, columns=["covariate", "non_empty", "missing", "non_empty_pct", "missing_pct"]))

    if "status" not in df.columns:
        print("No 'status' column; skipping stratified checks.")
        return

    # --- 2. Positivity (Stratified Counts) ---
    print(f"\n--- Positivity Checks ({name}) ---")
    
    for cov in present_covs:
        # Check for binned version if calculating pure categorical positivity
        col = f"{cov}_bin" if f"{cov}_bin" in df.columns else cov
        
        # Create Crosstab (Covariate vs Status)
        ct = pd.crosstab(df[col], df["status"], dropna=False)
        
        # Check for Zero Cells (Perfect Separation / Positivity Violation)
        zero_levels = ct.index[ct.eq(0).any(axis=1)].tolist()
        
        print(f"\n> {cov} × status:")
        
        # Truncate output for very long lists (like Cohort Years)
        if len(ct) > 12:
            print(ct.iloc[:5].to_string())
            print(f"  ... [ {len(ct)-10} rows hidden ] ...")
            print(ct.iloc[-5:].to_string())
        else:
            print(ct.to_string())
            
        if zero_levels:
            print(f"  [WARNING] Zero counts detected in levels: {zero_levels[:5]} (Total: {len(zero_levels)})")

# ==========================================
# 2. EXECUTION LOOP
# ==========================================

for name, df in dfs.items():
    coverage_and_positivity(df, name)


                               covariate non_empty missing non_empty_pct  \
0                  ttf_percentile_binned      9442       0       100.00%   
1                        founding_cohort      9442       0       100.00%   
2                                 sector      9442       0       100.00%   
3              prior_founding_experience      9442       0       100.00%   
4                      founder_education      9442       0       100.00%   
5                 founder_uni_reputation      9438       4        99.96%   
6                     founding_team_size      7086    2356        75.05%   
7                founding_team_diversity      6745    2697        71.44%   
8                          investor_type      9442       0       100.00%   
9                           top_investor      9442       0       100.00%   
10                         vc_cap_raised      9442       0       100.00%   
11  time_to_1_mil_percentile_year_binned    absent  absent        absent   

   missing