# Serial Founder Identification

This notebook inspects Crunchbase bulk export data to flag **serial founders**: people who have at least one *earlier* founding role.

We treat a founder role as “earlier” if its organization’s `founded_on` date is before the reference date for the current role. If an organization founding date is missing, we fall back to the role’s `started_on`.

**Inputs**
- `jobs.csv`: person ↔ organization roles (title/job_type, start/end dates)
- `organizations.csv`: organization metadata (founded_on, closed_on, primary_role)

**Outputs created in-memory**
- `founder_roles`: all roles classified as founder, annotated with prior-founding history
- `serial_founders`: subset of `founder_roles` where `had_prior_founder == True`
- `person_summary`: per-person counts of founder roles and known serial roles

**Important caveats**
- Results are only as complete as the Crunchbase export and its date coverage.
- Missing dates reduce our ability to order roles; those rows may be under-classified as serial.


## 1) Setup

Define the local path to the Crunchbase bulk export and configure Pandas display.


In [None]:
# Imports and configuration for the analysis.
# NOTE: `DATA_DIR` is a machine-specific path; update it if you run this notebook elsewhere.
from pathlib import Path
import pandas as pd

pd.set_option('display.max_columns', None)

DATA_DIR = Path('/Users/stefan/Desktop/Thesis/v4/Crunchbase Data/bulk_export').resolve()
assert DATA_DIR.exists(), f'Missing data directory: {DATA_DIR}'
DATA_DIR

PosixPath('/Users/stefan/Desktop/Thesis/v3/Crunchbase Data/bulk_export')

## 2) Load Crunchbase tables

Load `jobs.csv` (roles) and `organizations.csv` (org metadata) and normalize a few fields for joining and date comparisons.


In [4]:
# Load the minimum set of columns needed for founder-role identification.
# We read IDs and date fields as Pandas' nullable `string` dtype and treat empty strings as missing values.
# Organization columns are renamed with an `org_` prefix to avoid name collisions after merges.
jobs_cols = [
    'uuid',
    'person_uuid',
    'person_name',
    'org_uuid',
    'org_name',
    'title',
    'job_type',
    'started_on',
    'ended_on',
]

jobs = pd.read_csv(
    DATA_DIR / 'jobs.csv',
    usecols=jobs_cols,
    dtype='string',
    keep_default_na=False,
    na_values=[''],
)

org_info = pd.read_csv(
    DATA_DIR / 'organizations.csv',
    usecols=['uuid', 'founded_on', 'closed_on', 'primary_role'],
    dtype='string',
    keep_default_na=False,
    na_values=[''],
).rename(
    columns={
        'uuid': 'org_uuid',
        'founded_on': 'org_founded_on',
        'closed_on': 'org_closed_on',
        'primary_role': 'org_primary_role',
    }
)

for col in ['org_founded_on', 'org_closed_on']:
    org_info[col] = pd.to_datetime(org_info[col], errors='coerce')

jobs.head()

Unnamed: 0,uuid,person_uuid,person_name,org_uuid,org_name,started_on,ended_on,title,job_type
0,697b6934-fc1f-9d63-cfb2-1a10759b378e,ed13cd36-fe2b-3707-197b-0c2d56e37a71,Ben Elowitz,e1393508-30ea-8a36-3f96-dd3226033abd,Wetpaint,2005-10-01,2014-06-01,Co-Founder and CEO,executive
1,b1de3765-442e-b556-9304-551c2a055901,5ceca97b-493c-1446-6249-5aaa33464763,Kevin Flaherty,e1393508-30ea-8a36-3f96-dd3226033abd,Wetpaint,,,VP Marketing,executive
2,1319cd30-f5e8-c700-0af6-64029c6f7124,9f99a98a-aa97-b30b-0d36-db67c1d277e0,Raju Vegesna,bf4d7b0e-b34d-2fd8-d292-6049c4f7efc7,Zoho,2000-11-01,,Chief Evangelist,employee
3,27a252de-1ea8-c620-b2d4-5b889fa9b40f,6e1bca72-a865-b518-b305-31214ce2d1b0,Ian Wenig,bf4d7b0e-b34d-2fd8-d292-6049c4f7efc7,Zoho,2006-03-01,,VP Business Development,executive
4,5a802a79-229f-44ae-0aba-db330f10b67a,c92a1f00-8c19-bf2e-0f28-dbbd383dc968,Jay Adelson,5f2b40b8-d1b3-d323-d81a-b7a8e89553d0,Digg,2005-07-01,2010-04-05,CEO,executive


## 3) Identify founder roles

Filter the roles table down to founder-like roles using the `title` and `job_type` fields, then merge in organization founding dates.


In [5]:
# Identify founder-like roles using a simple keyword match on title and job_type.
# This is intentionally broad (e.g., it catches 'co-founder') and may include false positives.
# Date fields are parsed to datetimes so we can compare and sort chronologically.
founder_mask = (
    jobs['title'].str.contains('founder', case=False, na=False)
    | jobs['job_type'].str.contains('founder', case=False, na=False)
)

founder_roles = jobs.loc[founder_mask].copy()

for col in ['started_on', 'ended_on']:
    founder_roles[col] = pd.to_datetime(founder_roles[col], errors='coerce')

founder_roles = founder_roles.merge(org_info, on='org_uuid', how='left')
founder_roles = founder_roles.sort_values(['person_uuid', 'started_on', 'org_uuid']).reset_index(drop=True)
founder_roles.head()

Unnamed: 0,uuid,person_uuid,person_name,org_uuid,org_name,started_on,ended_on,title,job_type,org_founded_on,org_closed_on,org_primary_role
0,8a0b3671-fd90-4366-811f-0ac8074b2517,0000302b-29ea-477b-8745-5da042e56dc1,Yoaldis Pena Rodríguez,6b5499b8-40ca-4af3-ab57-f8114cc58074,ProDev Solution,NaT,NaT,Founder,executive,2009-01-01,NaT,company
1,949e871b-ff5c-4657-8525-8b242235dcc3,00003acb-a5fc-497e-9a6f-a0540950b7bd,Michelle Hertel,011a5cbb-42ce-4a64-877e-4d4d8f2c5979,Penta Machine Company,2011-01-01,NaT,"Co-Founder, Co-Owner and CEO",executive,2011-01-01,NaT,company
2,3a53c444-cb89-4ac0-b4fd-7b60e28edb00,00006245-8ec4-44cf-a8db-2ee996d955c0,James Smart,3b59c5a1-534b-481c-93d8-4fccaba95378,James Smart,1851-01-01,NaT,Founder,executive,1888-08-18,NaT,company
3,a341dbe8-2ae9-4816-b6b8-dbfa20679049,00008708-e454-4c59-beaf-08d1fc08a28d,Daniel Corazzi,8a39cdd7-5e56-4ea5-95ee-cb5cb749cddd,Eyeora,2017-06-01,NaT,Founder and CEO,executive,2017-01-01,NaT,company
4,ed985f05-beeb-4988-93a9-b1742a050c08,00009540-e9a2-4a4d-bd9d-71ea8e028f1d,Vinod Goyal,cc7186bc-2e64-40d9-a5b4-7b959b0664bb,Enterprise Information Services,1994-09-01,NaT,Founder & President,executive,1994-01-01,NaT,company


## 4) Annotate each founder role with prior founding history

For each person, we sort their founder roles chronologically and keep track of previously seen founding events. Each row gets: the number of earlier founding roles, a human-readable list of earlier orgs, and a boolean `had_prior_founder`.


In [6]:
# For each person, annotate every founder role with whether they had an earlier founding event.
# Reference date per role: prefer organization founded date; fall back to role start date if missing.
# This produces `prior_known_founder_count`, `prior_founding_orgs`, and `had_prior_founder`.
def annotate_prior_founders(group: pd.DataFrame) -> pd.DataFrame:
    # `group` contains all founder roles for a single person.
    # We scan chronologically and keep a running list of previously seen founding events.
    # Sorting makes the running comparison deterministic within a person.
    group = group.sort_values(['org_founded_on', 'started_on', 'org_uuid'])
    prior_records = []
    results = []

    for _, row in group.iterrows():
        # Choose a reference date representing the founding event for this row.
        # Prefer organization founded date; fall back to role start date if missing.
        reference_date = row['org_founded_on']
        if pd.isna(reference_date):
            reference_date = row['started_on']

        # Earlier founding events are those in `prior_records` that occurred before this reference date.
        earlier = []
        # Only compare dates when we have a reference date for the current row.
        if pd.notna(reference_date):
            earlier = [
                rec for rec in prior_records
                if pd.notna(rec['reference']) and rec['reference'] < reference_date
            ]

        # Attach summary fields used later for filtering and inspection.
        results.append({
            **row,
            'prior_known_founder_count': len(earlier),
            'prior_founding_orgs': '; '.join(rec['label'] for rec in earlier) if earlier else None,
            'had_prior_founder': len(earlier) > 0,
        })

        # Record this founding event so it can be counted as prior for later roles.
        # Use org name when available; fall back to org UUID for traceability.
        if pd.notna(reference_date):
            org_label = row.get('org_name')
            if isinstance(org_label, str) and org_label.strip():
                label = org_label
            else:
                label = row['org_uuid']
            prior_records.append({
                'reference': reference_date,
                'label': f"{label} ({reference_date.date()})",
            })

    return pd.DataFrame(results)


founder_roles = (
    founder_roles.groupby('person_uuid', group_keys=False)
    .apply(annotate_prior_founders)
    .sort_values(['person_uuid', 'started_on', 'org_uuid'])
    .reset_index(drop=True)
)

founder_roles[['person_name', 'org_name', 'started_on', 'prior_known_founder_count', 'had_prior_founder']].head()

  .apply(annotate_prior_founders)


Unnamed: 0,person_name,org_name,started_on,prior_known_founder_count,had_prior_founder
0,Yoaldis Pena Rodríguez,ProDev Solution,NaT,0,False
1,Michelle Hertel,Penta Machine Company,2011-01-01,0,False
2,James Smart,James Smart,1851-01-01,0,False
3,Daniel Corazzi,Eyeora,2017-06-01,0,False
4,Vinod Goyal,Enterprise Information Services,1994-09-01,0,False


## 5) Inspect serial founders

Subset to rows where we can establish at least one earlier founding role and preview the key fields.


In [7]:
# Inspect the subset of founder roles that are classified as serial (at least one earlier founding event).
serial_founders = founder_roles.loc[founder_roles['had_prior_founder']].copy()
serial_founders = serial_founders.sort_values(['person_name', 'started_on']).reset_index(drop=True)

columns_to_show = [
    'person_name',
    'person_uuid',
    'org_name',
    'org_uuid',
    'started_on',
    'prior_founding_orgs',
    'org_founded_on',
    'org_primary_role',
]
serial_founders.loc[:, columns_to_show].head(20)

Unnamed: 0,person_name,person_uuid,org_name,org_uuid,started_on,prior_founding_orgs,org_founded_on,org_primary_role
0,A P Jafar Parambil,acd0348c-3d64-430b-8511-3ce7b2db2ecb,GOEC Azad Power,9f5eafbd-6de0-40d6-815b-9b1395d3ac8e,2021-06-01,GOEC (2011-01-01),NaT,company
1,A Vijay Arisetty,c4ce1ac3-85b8-408e-b06c-05bdcb7b956d,Aurm,aa11484e-b699-4075-b6d9-7a353134608a,2024-06-01,MyGate (2016-01-01),2023-04-17,company
2,A. Balasubramanian,f205ab6a-c48a-4a08-86fe-5534cec51947,Sri Balaji Society,ab4b38ff-15a2-4989-b08f-4a48fc36a0be,1999-01-01,Symbiosis Institute Of Business Management (19...,1998-01-01,company
3,A. G. Krishnamurthy,c4279f07-70dd-44b4-b92c-150fd4a03b76,Mudra Institute of Communications Ahmedabad,67dfe1b6-e64e-c6e2-9fea-bc9f7e035a07,NaT,The DDB Mudra Group (1980-01-01),1991-01-01,school
4,A. Hossein Zarrin,4149bf6e-e8a5-4ef2-8706-631b235d7340,PharmaCare group,59d9e68c-abad-46d8-baa8-b2ab9e614224,2018-03-01,Persian Care (2006-01-01),2018-01-01,company
5,A. John Hart,729d6f10-1d36-d064-cec8-c16da5f464f4,VulcanForms,b69cb300-83df-45fb-8a32-992d7aac07da,2015-07-01,Desktop Metal (2015-01-01),2015-07-09,company
6,A. K. Pradeep,ea8d043c-06b4-812c-6882-775001d09a13,MachineVantage,5b8323c1-7a1b-cf22-69ff-fdb85ca702db,2016-01-01,Nasdaq Boardvantage (2000-01-01),2016-01-01,company
7,A. K. Pradeep,ea8d043c-06b4-812c-6882-775001d09a13,StimScience,5e0d888e-c2ea-46fc-a357-fc70b997e242,2017-01-01,Nasdaq Boardvantage (2000-01-01); MachineVanta...,2017-01-01,company
8,A. Kumar,4321ae7d-4cd2-4e7a-a22e-12f7667f6f8f,SharkStriker Inc.,06bffcb6-7dd9-4db1-97e2-43d6e1efc66b,2019-01-05,Cloud24x7 (2016-01-01),2019-01-05,company
9,A. Michael Blanche,80bd69d8-b7ba-42d3-ba48-57cadb61f321,Ethos Treatment,ad67f1f0-12ab-4ba2-9057-d1ddf50a7021,2017-12-01,Equine Therapeutic Alliance (2011-01-01),2018-01-01,company


## 6) Person-level summary

Aggregate founder roles to the person level to see how many founder roles each person has and how many are classified as serial.


In [8]:
# Summarize founder activity at the person level (how many founder roles; how many are known serial roles).
person_summary = (
    founder_roles.groupby(['person_uuid', 'person_name']).agg(
        total_founder_roles=('org_uuid', 'nunique'),
        earliest_founder_start=('started_on', 'min'),
        known_serial_roles=('had_prior_founder', 'sum'),
    ).reset_index().sort_values('known_serial_roles', ascending=False)
)

person_summary.head(20)

Unnamed: 0,person_uuid,person_name,total_founder_roles,earliest_founder_start,known_serial_roles
173339,396dc987-fa6b-d5db-c6da-487576633993,Noubar Afeyan,40,1989-01-01,39
56634,12ba95e0-0d35-44ee-89f6-8c263f143778,Uchechukwu Ajuzieogu,25,2014-06-01,24
14690,04d2a71d-a966-8c48-442f-f8a67b7a76bd,Hanson Gifford,23,1998-06-01,23
724422,f0b5beb4-36fb-7b16-6238-38aecaf0526c,William T Gross,22,1991-01-01,21
663952,dca1769a-c360-e2fd-6c65-d9623ae44c7f,Carlos Blanco,22,1999-09-01,21
137407,2d8adb40-1ee6-a58e-fb44-e000be64f94f,Peter Diamandis,18,1987-01-01,18
494836,a4365642-a877-6c70-0f3f-31bd460b6c37,Rehan Allahwala,18,1996-01-01,16
115430,263fe629-1547-4ac8-9e27-adc15f594474,Howard Leonhardt,16,1982-05-01,15
151037,32109616-7db2-d24b-be8e-59796c1b3cde,Jack Abraham,16,2008-01-01,15
421072,8b9efcb4-bd13-426a-bffe-972653ba6b2a,Moses TF Vibbie,16,2014-01-01,15


## 7) Export (optional)

Write the `serial_founders` table to disk for downstream use.


In [9]:
# Optional: save the serial founder dataset for downstream use.
# This writes to the notebook's current working directory and will overwrite an existing file with the same name.
# Optional: save the serial founder dataset for downstream use.
# This writes to the notebook's current working directory and will overwrite an existing file with the same name.
serial_founders.to_csv('serial_founders.csv', index=False)