# Founder Data Scraping

## Project Description

**What is this about?**

- Right now, I only have fundamental information about companies and their treatment of the seed round. But Venture Capital investors evaluate more than that and many early stage investors would agrue that the founding team is the most important argument why they invested. As we already have interesting Venture Capital categories in combination with the success, it would be interesting to know if the different investor types have certain investment tendencies according to the teams.

**What tools are used?**

- We use two tools to get the founder data. The first tool, Serper API, helps to get the linkedin profiles. The API is a google search api and with the correct command, the link of the founder can be extracted. The second tool is Phantombuster. Phantombuster is a multicase scraping engine and I will use it to use the linkedin profile urls to scrape the profile data. As we have around 19k companies with around 33k founders, this may take a while to respect the LinkedIn usage rules. 

**How is it done?**

- First of all, all founders are stored in a data frame together with the company name. This will result in a 33k-row data frame. Then, a new column is created that stores the request body which is a combination of the name, the startup name and an additional element that indicates that we are looking for a linkedin profile. As the API only has 2.5k free tokens, another 47.5k tokens will be purchased. The created data frame now has the linkedin profiles. To prevent misled results, I will integrate a quality check step and filter out all rows with a missing url. Then, we export the data frame and import it to the phantom buster platform. In batches of 1.5k per day, the data is scraped and stored. After finalization, the data can be downloaded and used for further cleaning and engineering.

## Data Import and Preparation

In [1]:
import pandas as pd
import os
from dotenv import load_dotenv
import requests
import json

os.chdir('/Users/janlinzner/Projects/Master-Thesis-Spatial-Proximity-Venture-Capital')

In [2]:
companies = pd.read_csv('data/sets-for-r/companies_seed.csv')

In [10]:
filtered_companies = companies[companies['Founders'].notna()]
founders_df = filtered_companies.assign(Founders=filtered_companies['Founders'].str.split(', ')).explode('Founders')
founders_df['Founder ID'] = (founders_df.reset_index().index + 1).astype(str).str.zfill(5)
result_df = founders_df[['Company ID', 'Organization Name', 'Organization Name URL', 'Founder ID', 'Founders', 'Success', 'Headquarters Country']]
result_df.rename(columns={'Founders': 'Founder Name'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df.rename(columns={'Founders': 'Founder Name'}, inplace=True)


In [11]:
result_df['Search Query'] = 'site:linkedin.com/in ' + ' ' + result_df['Organization Name'] + ' ' + result_df['Founder Name']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df['Search Query'] = 'site:linkedin.com/in ' + ' ' + result_df['Organization Name'] + ' ' + result_df['Founder Name']


In [5]:
success_df = result_df[result_df['Success']]
no_success_df = result_df[~result_df['Success']]

In [13]:
germany_df = result_df[result_df['Headquarters Country'] == 'Germany']

In [14]:
germany_df

Unnamed: 0,Company ID,Organization Name,Organization Name URL,Founder ID,Founder Name,Success,Headquarters Country,Search Query
5,9243,HAPILA GmbH,https://www.crunchbase.com/organization/hapila...,00005,Merle Arnika Fuchs,False,Germany,site:linkedin.com/in HAPILA GmbH Merle Arnika...
11,9017,InterNations GmbH,https://www.crunchbase.com/organization/intern...,00015,Malte Zeeck,True,Germany,site:linkedin.com/in InterNations GmbH Malte ...
11,9017,InterNations GmbH,https://www.crunchbase.com/organization/intern...,00016,Philipp von Plato,True,Germany,site:linkedin.com/in InterNations GmbH Philip...
24,9480,Pharetis,https://www.crunchbase.com/organization/pharetis,00026,Dirk Ehrlich,False,Germany,site:linkedin.com/in Pharetis Dirk Ehrlich
24,9480,Pharetis,https://www.crunchbase.com/organization/pharetis,00027,Peter Biermann,False,Germany,site:linkedin.com/in Pharetis Peter Biermann
...,...,...,...,...,...,...,...,...
16983,9665,heycare (former heynanny),https://www.crunchbase.com/organization/heynan...,30183,Julia Kahle,False,Germany,site:linkedin.com/in heycare (former heynanny...
16993,9713,Kodex AI,https://www.crunchbase.com/organization/kodex-ai,30197,Claus Lang,False,Germany,site:linkedin.com/in Kodex AI Claus Lang
16993,9713,Kodex AI,https://www.crunchbase.com/organization/kodex-ai,30198,Thomas Kaiser,False,Germany,site:linkedin.com/in Kodex AI Thomas Kaiser
17001,9699,Empion,https://www.crunchbase.com/organization/empion,30207,Annika von Mutius,False,Germany,site:linkedin.com/in Empion Annika von Mutius


In [12]:
result_df

Unnamed: 0,Company ID,Organization Name,Organization Name URL,Founder ID,Founder Name,Success,Headquarters Country,Search Query
0,948,Safetica Technologies,https://www.crunchbase.com/organization/safeti...,00001,Jakub Mahdal,False,Czech Republic,site:linkedin.com/in Safetica Technologies Ja...
3,16735,Quick TV,https://www.crunchbase.com/organization/quick-tv,00002,Nick Bell,False,United Kingdom,site:linkedin.com/in Quick TV Nick Bell
3,16735,Quick TV,https://www.crunchbase.com/organization/quick-tv,00003,Tod Yeadon,False,United Kingdom,site:linkedin.com/in Quick TV Tod Yeadon
4,11268,Imperative Energy,https://www.crunchbase.com/organization/impera...,00004,Joe O’Carroll,False,Ireland,site:linkedin.com/in Imperative Energy Joe O’...
5,9243,HAPILA GmbH,https://www.crunchbase.com/organization/hapila...,00005,Merle Arnika Fuchs,False,Germany,site:linkedin.com/in HAPILA GmbH Merle Arnika...
...,...,...,...,...,...,...,...,...
16996,18538,PocketEye,https://www.crunchbase.com/organization/pocketeye,30204,Meera Radia,False,United Kingdom,site:linkedin.com/in PocketEye Meera Radia
17000,18550,Combat IQ,https://www.crunchbase.com/organization/combat-iq,30205,Christian Giang,False,United Kingdom,site:linkedin.com/in Combat IQ Christian Giang
17000,18550,Combat IQ,https://www.crunchbase.com/organization/combat-iq,30206,Timur Malik,False,United Kingdom,site:linkedin.com/in Combat IQ Timur Malik
17001,9699,Empion,https://www.crunchbase.com/organization/empion,30207,Annika von Mutius,False,Germany,site:linkedin.com/in Empion Annika von Mutius


## Serper API

In [26]:
def fetch_linkedin_urls(
    source_df: pd.DataFrame,
    num_to_process: int,
    results_file: str = "founder_linkedin_urls.csv",
    api_key_env_var: str = "serper",
    df_key_columns: tuple = ("Company ID", "Founder ID"),
    query_column: str = "Search Query"
) -> pd.DataFrame:

    source = source_df.copy()
    for key in df_key_columns:
        source[key] = source[key].astype(int)  # e.g. "00001" → 1

    if os.path.exists(results_file):
        existing_results = pd.read_csv(
            results_file,
            dtype={df_key_columns[0]: int, df_key_columns[1]: int}
        )
    else:
        
        cache_columns = list(source.columns) + ["LinkedIn URL"]
        existing_results = pd.DataFrame(columns=cache_columns)
        for key in df_key_columns:
            existing_results[key] = existing_results[key].astype(int)

    merged = pd.merge(
        source,
        existing_results[list(df_key_columns)],  
        how="left",
        on=list(df_key_columns),
        indicator=True
    )
    
    new_rows_mask = merged["_merge"] == "left_only"
    unprocessed = merged.loc[new_rows_mask, source.columns] 

    to_process = unprocessed.head(num_to_process).copy()
    if to_process.empty:
        print("No new rows to process; all keys are already cached.")
        
        return pd.DataFrame(columns=list(source.columns) + ["LinkedIn URL"])

    to_process.reset_index(drop=True, inplace=True)
    to_process["LinkedIn URL"] = None

    SERPER_API_URL = "https://google.serper.dev/search"
    api_key = os.getenv(api_key_env_var)
    if not api_key:
        raise RuntimeError(f"Environment variable '{api_key_env_var}' not set.")
    headers = {"X-API-KEY": api_key}

    for idx, row in to_process.iterrows():
        company_id = int(row[df_key_columns[0]])
        founder_id = int(row[df_key_columns[1]])
        query_text = row[query_column]

        
        cached = existing_results[
            (existing_results[df_key_columns[0]] == company_id) &
            (existing_results[df_key_columns[1]] == founder_id)
        ]
        if not cached.empty:
            url = cached.iloc[0]["LinkedIn URL"]
            print(f"[Row {idx}] (cached) Query: '{query_text}' → URL: {url!r}")
            to_process.at[idx, "LinkedIn URL"] = url
            continue

        print(f"[Row {idx}] (requesting) Query: '{query_text}'")
        payload = {"q": query_text}

        try:
            resp = requests.post(SERPER_API_URL, json=payload, headers=headers)
        except Exception as e:
            print(f"  → Exception on row {idx}: {e}")
            continue

        if resp.status_code == 200:
            data = resp.json().get("organic", [])
            if data:
                first_link = data[0].get("link", "")
                if "linkedin.com/in" in first_link:
                    to_process.at[idx, "LinkedIn URL"] = first_link
                    print(f"  → Found: {first_link!r}")
                else:
                    to_process.at[idx, "LinkedIn URL"] = None
                    print(f"  → Top result not LinkedIn (got {first_link!r}); leaving as None")
            else:
                to_process.at[idx, "LinkedIn URL"] = None
                print("  → No organic results; leaving as None")
        else:
            to_process.at[idx, "LinkedIn URL"] = None
            print(f"  → HTTP {resp.status_code}: {resp.text[:200]!r}… (leaving as None)")


    new_rows_for_cache = to_process.copy() 
    combined = pd.concat([existing_results, new_rows_for_cache], ignore_index=True)

    
    combined.drop_duplicates(subset=list(df_key_columns), keep="first", inplace=True)
    combined.to_csv(results_file, index=False)

    return to_process

In [32]:
df_new = fetch_linkedin_urls(
     source_df=success_df,
     num_to_process=2000,
     results_file="notebook/founder_data/test_results.csv",
     api_key_env_var="serper",
     df_key_columns=("Company ID", "Founder ID"),
     query_column="Search Query"
)

[Row 0] (requesting) Query: 'site:linkedin.com/in  Sales Layer Iban Borras'
  → Found: 'https://es.linkedin.com/in/ibanborras'
[Row 1] (requesting) Query: 'site:linkedin.com/in  Qida Anna Montanes'
  → No organic results; leaving as None
[Row 2] (requesting) Query: 'site:linkedin.com/in  Qida Guillem Garcia Galofre'
  → No organic results; leaving as None
[Row 3] (requesting) Query: 'site:linkedin.com/in  Qida Lluís Guitart Moya'
  → Found: 'https://ad.linkedin.com/in/llu%C3%ADs-guitart-moya-986313142'
[Row 4] (requesting) Query: 'site:linkedin.com/in  Innovamat Andreu Dotti'
  → Found: 'https://es.linkedin.com/in/andreu-dotti-boada/en'
[Row 5] (requesting) Query: 'site:linkedin.com/in  Innovamat Isaac Sayol'
  → Found: 'https://es.linkedin.com/in/isaac-sayol-piedra-8675b2116'
[Row 6] (requesting) Query: 'site:linkedin.com/in  Innovamat Àlex Pérez-Muelas'
  → Found: 'https://es.linkedin.com/in/%C3%A0lex-espinet-p%C3%A9rez-muelas-31b97213b'
[Row 7] (requesting) Query: 'site:linkedin.com