# Crunchbase Study Dataset: Pre-processing

This notebook takes a merged Crunchbase export (one row per organization) and produces cleaned, analysis-ready CSVs for the thesis study.

**Inputs**
- `merged_big_30_11.csv` (merged Crunchbase dataset)

**Outputs**
- `global_companies.csv` (all retained orgs)
- `uk_companies.csv` (subset where `org_country == 'GBR'`)
- `usa_companies.csv` (subset where `org_country == 'USA'`)

**What happens in this pipeline (high-level)**
1. Load the merged Crunchbase file and de-duplicate by `org_uuid`.
2. Apply row-level cleaning rules (status cleanup, optional backfilling, dropping known-bad rows).
3. Remove impossible timelines (e.g., `closed_on < founded_on`).
4. Enforce a study *data freeze* (`freeze_date`) and limit companies to a founded-year window.
5. Keep only the columns needed downstream.
6. Parse the raw funding-round strings into stage-specific date/amount/UUID columns.
7. Add a coarse `region` variable from ISO country codes.
8. Export global + UK + US subsets.

Notes:
- The notebook uses **absolute paths** for input/output; adjust them if you move the project directory.
- Several steps are controlled via `STUDY_PARAMS['preprocessing']` for reproducibility.


In [1]:
# Imports used across the notebook.
# - pandas: core DataFrame wrangling
# - pycountry_convert: ISO country → continent mapping for a coarse `region` feature

from pathlib import Path

import pandas as pd
import pycountry_convert as pc


# Setting Parameters

This section defines the parameters that govern the preprocessing logic:
- `STUDY_PARAMS['preprocessing']`: toggles and cutoffs (e.g., date freeze, founded-year window).
- File paths for the merged input dataset and the exported outputs.

Keeping these values centralized makes the preprocessing reproducible and easier to audit.


In [2]:
# Study configuration (single source of truth).
# Adjust values here to re-run the pipeline with different cutoffs/toggles.
#
# Key parameters:
# - backfill_closed_on: if True, imputes missing `closed_on` for closed companies.
# - founded_year_range: defines the study cohort window.
# - freeze_date: caps event dates to prevent leakage beyond the cut-off.

# Centralized study parameters for easy tweaking
STUDY_PARAMS = {
    "preprocessing": {
        "backfill_closed_on": False,  # Changed from False to True
        "drop_primary_role_investor": True,
        "founded_year_range": (2007, 2017),
        # "founded_year_range": (2007, 2024),
        "freeze_date": pd.Timestamp("2024-12-31"),
        "stale_active_cutoff": pd.Timestamp("2021-01-01"), # deactivated
    }
}

In [3]:
# Input/output locations.
# Using absolute paths keeps this notebook unambiguous on the author machine;
# if you move the repo, update these paths accordingly.

data_path = Path("/Users/stefan/Desktop/Thesis/v4/Crunchbase Data/Merging CB Datasets/merged_big_30_11.csv")
output_path_global = "/Users/stefan/Desktop/Thesis/v4/Study/cb data pre-processing/global_companies.csv"
output_path_uk = "/Users/stefan/Desktop/Thesis/v4/Study/cb data pre-processing/uk_companies.csv"
output_path_usa =  "/Users/stefan/Desktop/Thesis/v4/Study/cb data pre-processing/usa_companies.csv"

# Filtering Data

Load the merged Crunchbase dataset and apply *row-level* cleaning rules. The goal is to remove duplicates and drop records that are clearly inconsistent (e.g., missing required identifiers or impossible event timelines).

Most filtering steps preserve the original columns; structural reshaping happens later (funding-round parsing).


In [4]:
# Load the merged Crunchbase dataset.
# `convert_dtypes()` gives pandas' best-guess nullable dtypes for cleaner downstream handling.

df_crunchbase = pd.read_csv(data_path, low_memory=False).convert_dtypes()

### Removing Duplicate Orgs

Crunchbase exports can contain duplicate rows for the same organization. We treat `org_uuid` as the unique identifier and keep the first occurrence.


In [5]:
# De-duplicate organizations by `org_uuid` (unique org identifier).
# Keeps the first occurrence and reports how many rows were removed.

rows_before = len(df_crunchbase)
df_crunchbase = df_crunchbase.dropna(subset=["org_uuid"]).drop_duplicates("org_uuid")
removed = rows_before - len(df_crunchbase)
print(f"Removed {removed:,} duplicate org_uuid rows")


Removed 2,318 duplicate org_uuid rows


## Filter

From here on, the notebook applies a sequence of deterministic filters:
- Optional `closed_on` backfilling for closed companies.
- Status normalization/flagging (informational).
- Dropping rows based on study rules (e.g., removing investors).
- Removing exact duplicate rows.


In [6]:
# Read preprocessing config and create reusable masks.
# Also prepares a boolean flag column to track whether `closed_on` was backfilled.

preproc_cfg = STUDY_PARAMS["preprocessing"]

status_clean = df_crunchbase["status"].astype(str).str.strip()
closed_mask = status_clean.eq("closed")
closed_missing_before = closed_mask & df_crunchbase["closed_on"].isna()
total_closed = int(closed_mask.sum())

df_crunchbase["closed_on_backfilled"] = False


### Backfill missing `closed_on` dates (optional) and log coverage

Some companies are labeled as `closed` but have missing `closed_on`. If `backfill_closed_on` is enabled, the notebook:
1. Extracts the **last funding date** from the raw `funding_round_*` strings.
2. Adds a 24-month runway assumption.
3. Caps the imputed date at the study `freeze_date`.

The notebook always logs how many rows are affected; the backfill itself is controlled by `STUDY_PARAMS`.


In [7]:
# Optional: backfill missing `closed_on` for closed companies.
# Strategy (if enabled): last funding date + 24 months, capped at freeze_date.

import re

# --- FIX: Ensure closed_on is datetime before assignment ---
df_crunchbase["closed_on"] = pd.to_datetime(df_crunchbase["closed_on"], errors="coerce")

missing_closed_on_count = int(closed_missing_before.sum())
filled_closed_on_count = 0

if preproc_cfg["backfill_closed_on"]:
    print("Extracting last funding date from round details...")
    
    # 1. Select the rows we need to fix
    # We work on a copy to avoid SettingWithCopy warnings
    target_rows = df_crunchbase.loc[closed_missing_before].copy()
    
    # 2. Define the columns to check (1 to 15 as requested)
    round_cols = [f"funding_round_{i}" for i in range(1, 16)]
    
    # 3. Extract dates from each column
    extracted_dates = pd.DataFrame(index=target_rows.index)
    
    for col in round_cols:
        if col in target_rows.columns:
            # Extract the date part
            dates_str = target_rows[col].astype(str).str.extract(r'\((\d{4}-\d{2}-\d{2})\)')[0]
            extracted_dates[col] = pd.to_datetime(dates_str, errors='coerce')
            
    # 4. Find the MAX date across all 15 columns for each company
    last_funding_computed = extracted_dates.max(axis=1)
    
    # 5. Apply the "Runway" Assumption (Last Funding + 24 Months)
    imputed_closure = last_funding_computed + pd.DateOffset(months=24)
    
    # 6. Cap at Freeze Date
    freeze_date = preproc_cfg["freeze_date"]
    capped_backfill = imputed_closure.where(
        imputed_closure <= freeze_date,
        freeze_date,
    )
    
    # 7. Apply the backfill
    # Only fill where we successfully found a funding date
    valid_fill_mask = capped_backfill.notna()
    
    # Update the main DataFrame
    # Align indices to ensure correct assignment
    # We use reindex to align the boolean mask with the main dataframe's index
    indexer = closed_missing_before & valid_fill_mask.reindex(df_crunchbase.index, fill_value=False)
    
    df_crunchbase.loc[indexer, "closed_on"] = capped_backfill
    df_crunchbase.loc[indexer, "closed_on_backfilled"] = True
    
    filled_closed_on_count = int(valid_fill_mask.sum())

percent_backfilled = (filled_closed_on_count / total_closed * 100.0) if total_closed else 0.0
print(
    f"Closed companies total: {total_closed:,}\n"
    f"Closed companies missing closed_on: {missing_closed_on_count:,}\n"
    f"Closed companies missing closed_on (backfilled): {filled_closed_on_count:,}\n"
    f"Share of closed companies backfilled: {percent_backfilled:.2f}%"
)

Closed companies total: 176,004
Closed companies missing closed_on: 112,350
Closed companies missing closed_on (backfilled): 0
Share of closed companies backfilled: 0.00%


### Flag stale “active” companies and acquired records (informational)

This step updates the `status` label for two cases:
- Organizations still marked as `active` but not updated since `stale_active_cutoff`.
- Organizations with a non-empty `acquired_by` field (treated as `acquired`).

This is meant for downstream interpretability; it does **not** drop rows.


In [8]:
# Informational status relabeling.
# - Marks stale 'active' orgs (no update since cutoff)
# - Marks 'active' orgs with an acquirer as 'acquired'

status_lower = status_clean.str.lower()
org_updated_ts = pd.to_datetime(df_crunchbase["org_updated_at"], errors="coerce")
stale_active_cutoff = preproc_cfg["stale_active_cutoff"]

active_mask = status_lower.eq("active")
stale_active_mask = active_mask & (org_updated_ts < stale_active_cutoff)
acquired_flag_mask = active_mask & df_crunchbase["acquired_by"].astype(str).str.strip().ne("")

df_crunchbase.loc[stale_active_mask, "status"] = "listed as active but no update in 5 years"
df_crunchbase.loc[acquired_flag_mask, "status"] = "acquired"

print(f"Flagged stale actives (no update since {stale_active_cutoff.date()}): {int(stale_active_mask.sum()):,}")
print(f"Flagged actives with acquisitions as acquired: {int(acquired_flag_mask.sum()):,}")


Flagged stale actives (no update since 2021-01-01): 0
Flagged actives with acquisitions as acquired: 0


### Drop unwanted rows and duplicates

Apply the main study exclusion rules:
- Optionally remove rows where `primary_role == 'investor'`.
- Drop rows where an acquisition is indicated (`acquired_by` present) but no acquisition date is available.
- Remove exact duplicate rows after filtering.

The notebook prints a breakdown of removals for auditability.


In [9]:
# Build exclusion masks used to drop rows:
# - investor-only records (optional)
# - acquired_by present but missing acquisition date
# - closed but missing closed_on (tracked for reporting)

primary_role = df_crunchbase["primary_role"].astype(str).str.strip()
mask_primary_role_investor = primary_role.eq("investor")

has_acquirer = (
    df_crunchbase["acquired_by"]
    .fillna("")        # keep real nulls empty
    .astype(str)
    .str.strip()
)
mask_acquired_mismatch = has_acquirer.ne("") & df_crunchbase["acquired_on_first"].isna()

mask_acquired_mismatch = has_acquirer.ne("") & df_crunchbase["acquired_on_first"].isna()

mask_closed_missing_date = status_clean.eq("closed") & df_crunchbase["closed_on"].isna()


In [10]:
# Apply the exclusion rules and remove exact duplicate rows post-filtering.

drop_mask = mask_acquired_mismatch.copy()
if preproc_cfg["drop_primary_role_investor"]:
    drop_mask |= mask_primary_role_investor

dropped_by_rules = int(drop_mask.sum())
df_filtered = df_crunchbase.loc[~drop_mask].copy()

duplicate_mask = df_filtered.duplicated(keep="first")
duplicates_dropped = int(duplicate_mask.sum())
if duplicates_dropped:
    df_filtered = df_filtered.loc[~duplicate_mask].copy()


In [11]:
# Audit printout: how many rows were filtered by each rule.

counts = [
    ("acquired_by present but acquired_on_first missing", int(mask_acquired_mismatch.sum())),
    ("status == 'closed' but closed_on missing", int(mask_closed_missing_date.sum())),
    ("closed_on backfilled from org_updated_at", filled_closed_on_count),
    ("exact duplicate rows", duplicates_dropped),
]
if preproc_cfg["drop_primary_role_investor"]:
    counts.insert(0, ("primary_role == 'investor'", int(mask_primary_role_investor.sum())))

print("Rows filtered by reason:")
for reason, count in counts:
    print(f"  {reason}: {count:,}")

total_removed = dropped_by_rules + duplicates_dropped
print(f"Total rows removed (union reasons + duplicates): {total_removed:,}")
print(f"Remaining rows: {len(df_filtered):,}")


Rows filtered by reason:
  primary_role == 'investor': 104,607
  acquired_by present but acquired_on_first missing: 0
  status == 'closed' but closed_on missing: 112,350
  closed_on backfilled from org_updated_at: 0
  exact duplicate rows: 0
Total rows removed (union reasons + duplicates): 104,607
Remaining rows: 3,887,655


### Filter out negative event durations

Remove organizations with impossible timelines:
- `closed_on < founded_on`
- `went_public_on < founded_on`
- `acquired_on_first < founded_on`

The notebook reports counts per failure type and then drops all flagged rows.


In [12]:
# Helper: identify organizations with impossible event ordering (negative durations).
# Used for reporting; the next cell performs the actual dropping.

def find_negative_durations(df: pd.DataFrame) -> dict[str, pd.Index]:
    required = ["founded_on", "closed_on", "went_public_on", "acquired_on_first"]
    missing = [col for col in required if col not in df.columns]
    if missing:
        raise KeyError(f"Missing required columns: {missing}")

    founded = pd.to_datetime(df["founded_on"], errors="coerce")
    closed = pd.to_datetime(df["closed_on"], errors="coerce")
    went_public = pd.to_datetime(df["went_public_on"], errors="coerce")
    acquired = pd.to_datetime(df["acquired_on_first"], errors="coerce")

    masks = {
        "time_to_close": founded.notna() & closed.notna() & (closed < founded),
        "time_to_ipo": founded.notna() & went_public.notna() & (went_public < founded),
        "time_to_acquisition": founded.notna() & acquired.notna() & (acquired < founded),
    }

    negatives = {
        label: (df.loc[mask, "org_uuid"] if "org_uuid" in df.columns else df.index[mask])
        for label, mask in masks.items()
    }

    for label, ids in negatives.items():
        print(f"{label}: {len(ids)} negative durations")

    return negatives

negative_ids = find_negative_durations(df_filtered)


time_to_close: 44 negative durations
time_to_ipo: 676 negative durations
time_to_acquisition: 584 negative durations


In [13]:
# Drop all organizations with negative durations across close/IPO/acquisition timelines.

def drop_negative_durations(df: pd.DataFrame) -> tuple[pd.DataFrame, dict[str, pd.Index]]:
    required = ["founded_on", "closed_on", "went_public_on", "acquired_on_first"]
    missing = [col for col in required if col not in df.columns]
    if missing:
        raise KeyError(f"Missing required columns: {missing}")

    founded = pd.to_datetime(df["founded_on"], errors="coerce")
    closed = pd.to_datetime(df["closed_on"], errors="coerce")
    went_public = pd.to_datetime(df["went_public_on"], errors="coerce")
    acquired = pd.to_datetime(df["acquired_on_first"], errors="coerce")

    masks = {
        "time_to_close": founded.notna() & closed.notna() & (closed < founded),
        "time_to_ipo": founded.notna() & went_public.notna() & (went_public < founded),
        "time_to_acquisition": founded.notna() & acquired.notna() & (acquired < founded),
    }

    negatives = {
        label: (df.loc[mask, "org_uuid"] if "org_uuid" in df.columns else df.index[mask])
        for label, mask in masks.items()
    }

    combined_mask = masks["time_to_close"] | masks["time_to_ipo"] | masks["time_to_acquisition"]
    cleaned = df.loc[~combined_mask].copy()

    for label, ids in negatives.items():
        print(f"{label}: {len(ids)} negative durations")
    print(f"Kept rows: {len(cleaned):,} / {len(df):,}")

    return cleaned, negatives

df_filtered, negative_ids = drop_negative_durations(df_filtered)


time_to_close: 44 negative durations
time_to_ipo: 676 negative durations
time_to_acquisition: 584 negative durations
Kept rows: 3,886,357 / 3,887,655


# Enforcing Data Freeze and Cut-off

Two study-wide constraints are applied here:

1. **Founded-year window**: keep only organizations founded within the configured year range.
2. **Data freeze**: cap all relevant event dates at `freeze_date` so that downstream analyses do not leak information past the study cut-off.


In [14]:
# Apply the founded-year cohort filter and enforce the study freeze date.
# Any event timestamp later than `freeze_date` is capped to `freeze_date`.

founded_ts = pd.to_datetime(df_filtered["founded_on"], errors="coerce")
year_min, year_max = preproc_cfg["founded_year_range"]
df_filtered2 = df_filtered.loc[founded_ts.dt.year.between(year_min, year_max, inclusive="both")].copy()

cutoff = preproc_cfg["freeze_date"]
date_cols = [
    "created_at",
    "last_funding_on",
    "closed_on",
    "first_funding_date",
    "acquired_on_first",
    "acquired_on_last",
    "first_acquired_company_on",
    "last_acquired_company_on",
    "ipo_went_public_on",
    "went_public_on",
    "org_created_at",
    "org_updated_at",
]

for col in df_filtered2.columns.intersection(date_cols):
    ts = pd.to_datetime(df_filtered2[col], errors="coerce")
    df_filtered2[col] = ts.mask(ts > cutoff, cutoff)

df_filtered2.info()


<class 'pandas.core.frame.DataFrame'>
Index: 1140666 entries, 15 to 3994570
Columns: 119 entries, org_uuid to closed_on_backfilled
dtypes: Float64(2), Int64(18), bool(1), datetime64[ns](12), string(86)
memory usage: 1.0 GB


# Filtering down to useful columns

At this point the dataset is reduced to the columns required by the downstream feature construction and modeling steps.

This includes identifiers, status/outcome dates, founder attributes, and the raw `funding_round_*` strings used for later parsing.


In [15]:
# Select the final column set used for the thesis study dataset.
# This keeps identifiers/outcomes/founder variables plus raw funding_round strings for later parsing.

df_study = df_filtered2.loc[:, [
    "org_uuid",
    "org_name",
    "legal_name",
    "homepage_url",
    "status",
    "org_country",
    "org_city",
    "founded_on",
    "first_funding_date",
    "date_of_1_million",
    "weighted_time",
    "funding_25pct_date",
    "funding_25pct_round_number",
    "first_round_size_to_total_funding",
    "first_funding_investor_type",
    "first_round_investor_uuid",
    "first_funding_leads",
    "founders_has_phd",
    "founders_has_mba",
    "founders_has_masters",
    "founders_has_bachelors",
    "founders_has_jd",
    "founders_degrees",
    "founders_count",
    "founders_countries",
    "founders_female_count",
    "founders_male_count",
    "founders_descriptions",
    "parent_uuid",
    "employee_count",
    "closed_on",
    "went_public_on",
    "acquired_on_first",
    "first_funding_raised_usd",
    "total_funding_usd",
    "first_funding_post_money_usd",
    "num_funding_rounds",
    "category_list",
    "category_groups_list",

    "funding_round_1",
    "funding_round_2",
    "funding_round_3",
    "funding_round_4",
    "funding_round_5",
    "funding_round_6",
    "funding_round_7",
    "funding_round_8",
    "funding_round_9",
    "funding_round_10",
    "funding_round_11",
    "funding_round_12",
    "funding_round_13",
    "funding_round_14",
    "funding_round_15",
    "funding_round_16",
    "funding_round_17",
    "funding_round_18",
    "funding_round_19",
    "funding_round_20",
    "funding_round_21",
    "funding_round_22",
    "funding_round_23",
    "funding_round_24",
    "funding_round_25",
    "funding_round_26",
    "funding_round_27",
    "funding_round_28",
    "funding_round_29",
    "funding_round_30",
    "funding_round_31",
    "funding_round_32",
    "funding_round_33",
    "funding_round_34",
    "funding_round_35",
    "funding_round_36",
    "funding_round_37",
    "funding_round_38",
    "funding_round_39",
    "funding_round_40",
    "funding_round_41",
    "funding_round_42",
    "funding_round_43"
]].copy()

# Parsing and validating funding-round histories

Crunchbase funding rounds arrive as semi-structured strings (type/date/amount/investor UUIDs). This section:
- Parses raw `funding_round_*` columns into a compact, chronological sequence.
- Projects that sequence into stage-specific columns (e.g., `date_seed`, `amount_series_a`, `uuids_angel`).
- Drops companies with impossible stage ordering (chronology sanity checks).

The result is easier to use in event-time analyses (e.g., “time from seed to Series A”).


In [16]:
# Funding-round parsing and validation.
# Converts semi-structured `funding_round_*` strings into:
# - a compact chronological sequence (logical rounds)
# - stage-specific columns like `date_seed`, `amount_series_a`, `uuids_angel`

import pandas as pd
import numpy as np
import re

# ==========================================
# 1. SETUP & DEFINITIONS
# ==========================================

ALLOWED_TYPES = {
    'pre_seed', 'seed', 'angel',
    'venture', 
    'series_a', 'series_b', 'series_c', 'series_d', 'series_e', 
    'series_f', 'series_g', 'series_h', 'series_i', 'series_j'
}

# Mapping types to destination prefixes
TYPE_TO_DESTINATION_MAP = {
    'pre_seed': 'pre_seed',
    'angel':    'angel',
    'seed':     'seed',
    'series_a': 'series_a',
    'series_b': 'series_b',
    'series_c': 'series_c',
    'series_d': 'series_c',
    'series_e': 'series_c',
    'series_f': 'series_c',
    'series_g': 'series_c',
    'series_h': 'series_c',
}

PARSING_PATTERN = re.compile(r"^(.*?)\s*\(([\d-]{4,10})\)\s*(.*)$")

# ==========================================
# 2. PARSING LOGIC
# ==========================================

def get_compact_sequence(row, input_cols, max_depth=10):
    parsed_rounds = []
    seen_types = set()
    
    for col in input_cols:
        val = row[col]
        if pd.isna(val) or val == "":
            continue
        
        s_val = str(val).strip()
        match = PARSING_PATTERN.match(s_val)
        
        if not match:
            continue
            
        r_type = match.group(1).strip().lower()
        r_date = match.group(2)
        remainder = match.group(3).strip()
        
        r_amount = None
        r_uuids = None
        
        if remainder:
            parts = remainder.split(' ', 1)
            if any(char.isdigit() for char in parts[0]):
                r_amount = parts[0]
                r_uuids = parts[1] if len(parts) > 1 else None
            else:
                r_amount = None
                r_uuids = remainder
        
        if r_type in ALLOWED_TYPES:
            parsed_rounds.append({
                'type': r_type,
                'date': r_date,
                'amount': r_amount,
                'uuids': r_uuids
            })

    # Sort Chronologically
    parsed_rounds.sort(key=lambda x: x['date'])
    
    flat_data = []
    for r in parsed_rounds:
        if r['type'] not in seen_types:
            flat_data.extend([r['type'], r['date'], r['amount'], r['uuids']])
            seen_types.add(r['type'])
            if len(seen_types) >= max_depth:
                break
            
    # Padding
    rounds_found = len(seen_types)
    if rounds_found < max_depth:
        flat_data += [None, None, None, None] * (max_depth - rounds_found)
        
    return flat_data

# ==========================================
# 3. APPLY TO DATAFRAME
# ==========================================

raw_funding_cols = [c for c in df_study.columns if c.startswith("funding_round_")]
# Ensure chronological inspection (though logic sorts anyway)
raw_funding_cols.sort(key=lambda x: int(x.split('_')[-1]) if '_' in x and x.split('_')[-1].isdigit() else 0)

MAX_LOGICAL_STEPS = 10 
compact_col_names = []
for i in range(MAX_LOGICAL_STEPS):
    compact_col_names.extend([
        f"logical_round_{i+1}", 
        f"logical_round_{i+1}_date", 
        f"logical_round_{i+1}_amount",
        f"logical_round_{i+1}_uuids"
    ])

print("Parsing funding strings...")
compact_data = df_study.apply(
    lambda row: get_compact_sequence(row, raw_funding_cols, MAX_LOGICAL_STEPS), 
    axis=1, 
    result_type='expand'
)
compact_data.columns = compact_col_names

cols_to_drop = [c for c in df_study.columns if c.startswith("logical_round_") or c.startswith("funding_round_")]
df_study = pd.concat([df_study.drop(columns=cols_to_drop, errors='ignore'), compact_data], axis=1)

# ==========================================
# 4. CREATE STAGE-SPECIFIC COLUMNS
# ==========================================
print("Mapping logical rounds to stage specific columns...")

unique_stages = set(TYPE_TO_DESTINATION_MAP.values())
for stage in unique_stages:
    df_study[f"date_{stage}"] = pd.NaT
    df_study[f"amount_{stage}"] = None
    df_study[f"uuids_{stage}"] = None

for i in range(1, MAX_LOGICAL_STEPS + 1):
    log_type = f"logical_round_{i}"
    log_date = f"logical_round_{i}_date"
    log_amt  = f"logical_round_{i}_amount"
    log_uuid = f"logical_round_{i}_uuids"
    
    for src_type, target_suffix in TYPE_TO_DESTINATION_MAP.items():
        target_date_col = f"date_{target_suffix}"
        target_amt_col  = f"amount_{target_suffix}"
        target_uuid_col = f"uuids_{target_suffix}"
        
        mask = (df_study[log_type] == src_type) & (df_study[target_date_col].isna())
        
        if mask.any():
            df_study.loc[mask, target_date_col] = df_study.loc[mask, log_date]
            df_study.loc[mask, target_amt_col]  = df_study.loc[mask, log_amt]
            df_study.loc[mask, target_uuid_col] = df_study.loc[mask, log_uuid]

date_cols = [f"date_{s}" for s in unique_stages]
for col in date_cols:
    df_study[col] = pd.to_datetime(df_study[col], errors='coerce')

# ==========================================
# 6. CHRONOLOGY SANITY CHECK (ROBUST)
# ==========================================
print("Running comprehensive chronology sanity checks...")

# Define the strict chronological order
stage_order = [
    'date_pre_seed', 
    'date_seed', 
    'date_series_a', 
    'date_series_b', 
    'date_series_c'
]

df_study['chronology_valid'] = True

# Nested Loop: Compare EVERY stage against ALL SUBSEQUENT stages
# Example: Check Pre-Seed vs Seed, Pre-Seed vs Series A, Pre-Seed vs Series B...
for i in range(len(stage_order)):
    early_stage = stage_order[i]
    
    for j in range(i + 1, len(stage_order)):
        late_stage = stage_order[j]
        
        if early_stage not in df_study.columns or late_stage not in df_study.columns:
            continue
            
        # Error if: Both exist AND Early Date > Late Date
        mask_error = (
            df_study[early_stage].notna() & 
            df_study[late_stage].notna() & 
            (df_study[early_stage] > df_study[late_stage])
        )
        
        error_count = mask_error.sum()
        if error_count > 0:
            print(f"  ! Found {error_count} rows where {early_stage} is AFTER {late_stage}")
            df_study.loc[mask_error, 'chronology_valid'] = False

rows_to_drop = (~df_study['chronology_valid']).sum()

if rows_to_drop > 0:
    print(f"Removing {rows_to_drop} companies with impossible timelines...")
    df_study = df_study[df_study['chronology_valid']].copy()
    df_study.drop(columns=['chronology_valid'], inplace=True)
else:
    print("  - No chronology errors found.")

# ==========================================
# 7. EXPORT SUBSETS
# ==========================================

base_name = "study" 

# 1. Pre-Seed Only
df_pre_seed = df_study[df_study['date_pre_seed'].notna()].copy()
print(f"df_{base_name}_pre_seed: {len(df_pre_seed)}")

# 2. Seed Only
df_seed = df_study[df_study['date_seed'].notna()].copy()
print(f"df_{base_name}_seed: {len(df_seed)}")

# 3. Series A Only
df_series_a = df_study[df_study['date_series_a'].notna()].copy()
print(f"df_{base_name}_series_a: {len(df_series_a)}")

# 4. Seed to Series A (Both exist)
df_seed_to_a = df_study[df_study['date_seed'].notna() & df_study['date_series_a'].notna()].copy()
print(f"df_{base_name}_seed_to_series_a: {len(df_seed_to_a)}")

# 5. Angel Only
df_angel = df_study[df_study['date_angel'].notna()].copy()
print(f"df_{base_name}_angel: {len(df_angel)}")

print("Processing complete.")

Parsing funding strings...
Mapping logical rounds to stage specific columns...
Running comprehensive chronology sanity checks...
  ! Found 667 rows where date_pre_seed is AFTER date_seed
  ! Found 63 rows where date_pre_seed is AFTER date_series_a
  ! Found 16 rows where date_pre_seed is AFTER date_series_b
  ! Found 9 rows where date_pre_seed is AFTER date_series_c
  ! Found 200 rows where date_seed is AFTER date_series_a
  ! Found 46 rows where date_seed is AFTER date_series_b
  ! Found 14 rows where date_seed is AFTER date_series_c
  ! Found 27 rows where date_series_a is AFTER date_series_b
  ! Found 13 rows where date_series_a is AFTER date_series_c
  ! Found 14 rows where date_series_b is AFTER date_series_c
Removing 948 companies with impossible timelines...
df_study_pre_seed: 13820
df_study_seed: 74722
df_study_series_a: 34446
df_study_seed_to_series_a: 14782
df_study_angel: 17446
Processing complete.


# Exporting Global, UK, and US subsets

After cleaning, the notebook adds a coarse `region` (continent-level) and exports:
- the full cleaned dataset (`global_companies.csv`)
- a UK-only subset (`org_country == 'GBR'`)
- a US-only subset (`org_country == 'USA'`)

It also prints basic `info()` summaries and counts of key statuses for quick sanity checks.


### Fixing Country Codes / Regions

Crunchbase country codes are stored as ISO **alpha-3** codes (e.g., `GBR`, `USA`). For some analyses it is useful to have a broader region label.

This cell maps ISO alpha-3 codes → ISO alpha-2 → continent code, using `pycountry_convert`, with a small override table for legacy codes.


In [17]:
# Create a continent-level `region` variable from ISO alpha-3 country codes.

_continent_names = {
    "AF": "Africa",
    "AN": "Antarctica",
    "AS": "Asia",
    "EU": "Europe",
    "NA": "North America",
    "OC": "Oceania",
    "SA": "South America",
}

_alpha3_overrides = {
    "ROM": "ROU",  # legacy ISO code for Romania
    "TAN": "TZA",  # legacy code for Tanzania
}

def alpha3_to_region(alpha3):
    if not isinstance(alpha3, str) or not alpha3:
        return pd.NA
    alpha3 = _alpha3_overrides.get(alpha3.upper(), alpha3.upper())
    try:
        alpha2 = pc.country_alpha3_to_country_alpha2(alpha3)
        continent_code = pc.country_alpha2_to_continent_code(alpha2)
    except (KeyError, ValueError):
        return "Other"
    return _continent_names.get(continent_code, "Other")

df_study["region"] = df_study["org_country"].map(alpha3_to_region).astype("category")

### Filtering UK

Create the UK subset (`org_country == 'GBR'`), export it, and run quick sanity checks.


In [18]:
# UK subset (ISO alpha-3: GBR).

uk_comps = df_study.loc[df_study["org_country"].eq("GBR")].copy()

In [19]:
# Persist the cleaned datasets to disk.

df_study.to_csv(output_path_global, index=False)
uk_comps.to_csv(output_path_uk, index=False)


In [20]:
# Quick schema/memory sanity check for the full (global) study dataset.
df_study.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1139718 entries, 15 to 3994570
Data columns (total 98 columns):
 #   Column                             Non-Null Count    Dtype         
---  ------                             --------------    -----         
 0   org_uuid                           1139718 non-null  string        
 1   org_name                           1139711 non-null  string        
 2   legal_name                         624300 non-null   string        
 3   homepage_url                       1124380 non-null  string        
 4   status                             1139718 non-null  string        
 5   org_country                        1089957 non-null  string        
 6   org_city                           1089957 non-null  string        
 7   founded_on                         1139718 non-null  string        
 8   first_funding_date                 131581 non-null   datetime64[ns]
 9   date_of_1_million                  70990 non-null    string        
 10  weighted_t

In [21]:
# Quick schema/memory sanity check for the UK subset.
uk_comps.info()

<class 'pandas.core.frame.DataFrame'>
Index: 83062 entries, 440 to 3992639
Data columns (total 98 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   org_uuid                           83062 non-null  string        
 1   org_name                           83062 non-null  string        
 2   legal_name                         47041 non-null  string        
 3   homepage_url                       82297 non-null  string        
 4   status                             83062 non-null  string        
 5   org_country                        83062 non-null  string        
 6   org_city                           83062 non-null  string        
 7   founded_on                         83062 non-null  string        
 8   first_funding_date                 8401 non-null   datetime64[ns]
 9   date_of_1_million                  4814 non-null   string        
 10  weighted_time                      

In [22]:
# Quick sanity check: count IPO/acquired labels in the exported subset(s).

dfs_for_status_counts = [
    "uk_comps"
]

status_counts = {}

for name in dfs_for_status_counts:
    _df = globals().get(name)
    if _df is None or "status" not in _df.columns:
        continue

    status_series = _df["status"].astype(str).str.lower()
    counts = status_series.value_counts()
    status_counts[name] = {
        "ipo": int(counts.get("ipo", 0)),
        "acquired": int(counts.get("acquired", 0)),
    }

if status_counts:
    print("IPO and Acquired counts per DataFrame:")
    for name, counts in status_counts.items():
        print(f" - {name}: IPO={counts['ipo']:,} | Acquired={counts['acquired']:,}")
else:
    print("No DataFrames with a status column were found for counting.")


IPO and Acquired counts per DataFrame:
 - uk_comps: IPO=269 | Acquired=3,536


### Filtering US

Create the US subset (`org_country == 'USA'`), export it, and run quick sanity checks.


In [23]:
# US subset (ISO alpha-3: USA).

usa_comps = df_study.loc[df_study["org_country"].eq("USA")].copy()

In [24]:
# Export the US subset to disk.
usa_comps.to_csv(output_path_usa, index=False)

In [25]:
# Quick schema/memory sanity check for the US subset.
usa_comps.info()

<class 'pandas.core.frame.DataFrame'>
Index: 417119 entries, 15 to 3994570
Data columns (total 98 columns):
 #   Column                             Non-Null Count   Dtype         
---  ------                             --------------   -----         
 0   org_uuid                           417119 non-null  string        
 1   org_name                           417117 non-null  string        
 2   legal_name                         242438 non-null  string        
 3   homepage_url                       412698 non-null  string        
 4   status                             417119 non-null  string        
 5   org_country                        417119 non-null  string        
 6   org_city                           417119 non-null  string        
 7   founded_on                         417119 non-null  string        
 8   first_funding_date                 50509 non-null   datetime64[ns]
 9   date_of_1_million                  30904 non-null   string        
 10  weighted_time          

In [26]:
# Quick sanity check: count IPO/acquired labels in the exported subset(s).

dfs_for_status_counts = [
    "usa_comps"
]

status_counts = {}

for name in dfs_for_status_counts:
    _df = globals().get(name)
    if _df is None or "status" not in _df.columns:
        continue

    status_series = _df["status"].astype(str).str.lower()
    counts = status_series.value_counts()
    status_counts[name] = {
        "ipo": int(counts.get("ipo", 0)),
        "acquired": int(counts.get("acquired", 0)),
    }

if status_counts:
    print("IPO and Acquired counts per DataFrame:")
    for name, counts in status_counts.items():
        print(f" - {name}: IPO={counts['ipo']:,} | Acquired={counts['acquired']:,}")
else:
    print("No DataFrames with a status column were found for counting.")


IPO and Acquired counts per DataFrame:
 - usa_comps: IPO=1,540 | Acquired=15,294
