# The Geography of Startup Success: The Effect of Proximity To Elite Universities On Startups' Growth and Investment. 
### by Jessica Dowuona-Owoo

#### Introduction
It is almost intuitive to believe there is some form of significance to startups of being located near a university. The location of a startup is one of the most important endogenous factors that businesses consider, and this paper seeks to determine the relationship between proximity to top-tier universities and start-up investment success. 

Many studies have been done focusing on a particular relationship between universities and startups (or innovation), for example, determining the role universities specialized in technical fields, say engineering and applied science (Bonaccorsi et al., 2013), have on startup creation or start-up operations in general (Fritsch & Aamoucke, 2017). Other studies show positive spillovers between research universities and innovation by providing research and development (R&D) initiatives (Anselin et al., 1997). Additionally, some papers show an opposing correlation to fields of study that may have been seen to be the more attractive (Audretsch et al., 2005). Further studies also look at the role of access to knowledge bases in determining the type of startups that are created in the area (Baptista et al., 2010). 

To research this question, we will use a startup investments data set, which contains information on start-ups from 1980 to 2014, and United States (US) university data, which includes ranking, population, founding year and other information about universities. The focus of this paper is to look at the US. Thus, we will consider universities and startups in the US. With this data we will explore the relationship between proximity and investment success. With the insights gained from this paper, startups may be able to make more informed decisions in the early stages of their life span.

In [3]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [1]:
import matplotlib
import matplotlib.colors as mplc
import matplotlib.patches as patches
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from geopy.distance import geodesic


In [3]:
#Data from the Startups data set
objects = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/objects.csv', low_memory=False)
investments = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/investments.csv')
office = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/offices.csv')
milestones = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/milestones.csv')
funds = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/funds.csv')
ipos = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/ipos.csv')
f_rounds = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/funding_rounds.csv')

# renaming values
objects.rename(columns={'id':'funded_object_id'}, inplace=True)
f_rounds.rename(columns={'object_id':'funded_object_id'}, inplace=True)

#remove unwanted columns
obj = objects.drop(columns= ['created_at', 'updated_at', 'tag_list', 'logo_height', 'logo_width', 
                             'logo_url', 'created_by', 'twitter_username', 'homepage_url'], inplace= False)
inv = investments.drop(columns= ['created_at', 'updated_at'], inplace=False)
f_rounds = f_rounds.drop(columns = ['source_description', 'created_at', 'updated_at', 'created_by', 'pre_money_valuation', 
                                    'raised_amount','post_money_valuation', 'post_money_currency_code', 
                                    'pre_money_valuation', 'pre_money_currency_code'], inplace=False)

In [5]:
source = pd.merge(obj, inv, on= 'funded_object_id')

In [7]:
source_funds = pd.merge(source, f_rounds, on= 'funded_object_id')

In [9]:
source_funds['country_code'].value_counts()

country_code
USA    177677
GBR     11102
CAN      4701
DEU      3836
ISR      3037
        ...  
SMR         1
BHR         1
GIN         1
NPL         1
SLV         1
Name: count, Length: 100, dtype: int64

In [11]:
#Choosing only USA data
source_usa = source_funds[source_funds["country_code"] == "USA"]
source_usa.head()

Unnamed: 0,funded_object_id,entity_type,entity_id,parent_id,name,normalized_name,permalink,category_code,status,founded_at,...,funding_round_type,funding_round_code,raised_amount_usd,raised_currency_code,pre_money_valuation_usd,post_money_valuation_usd,participants,is_first_round,is_last_round,source_url
0,c:1,Company,1,,Wetpaint,wetpaint,/company/wetpaint,web,operating,2005-10-17,...,series-a,a,5250000.0,USD,0.0,0.0,2,0,1,http://seattlepi.nwsource.com/business/246734_...
1,c:1,Company,1,,Wetpaint,wetpaint,/company/wetpaint,web,operating,2005-10-17,...,series-b,b,9500000.0,USD,0.0,0.0,3,0,0,http://pulse2.com/2007/01/09/wiki-builder-webs...
2,c:1,Company,1,,Wetpaint,wetpaint,/company/wetpaint,web,operating,2005-10-17,...,series-c+,c,25000000.0,USD,0.0,0.0,4,1,0,http://www.accel.com/news/news_one_up.php?news...
3,c:1,Company,1,,Wetpaint,wetpaint,/company/wetpaint,web,operating,2005-10-17,...,series-a,a,5250000.0,USD,0.0,0.0,2,0,1,http://seattlepi.nwsource.com/business/246734_...
4,c:1,Company,1,,Wetpaint,wetpaint,/company/wetpaint,web,operating,2005-10-17,...,series-b,b,9500000.0,USD,0.0,0.0,3,0,0,http://pulse2.com/2007/01/09/wiki-builder-webs...


In [13]:
#rename values 
office.rename(columns={'object_id':'funded_object_id'}, inplace=True)
#dropping columns as they become duplicates in the main source file
office = office.drop(columns= ['created_at', 'updated_at', 'country_code', 'state_code', 'region', 'city'], inplace=False)

#merge offices and source usa to gain location data 
main = pd.merge(source_usa, office, on= 'funded_object_id')
main = main.dropna(subset = ['latitude' , 'longitude'])

main['country_code'].value_counts()

country_code
USA    231381
Name: count, dtype: int64

In [9]:
#Data from the additional data set 
uni = pd.read_csv('/Users/jessyterlisner/Desktop/ECO225Project/Data/Universities, the United States.csv')
uni = uni.dropna(subset = ['latitude' , 'longitude'])
uni.head()

Unnamed: 0,university,country,domain,city,ranking,address,foundation year,description,total students,undergraduate students,graduate students,international students,latitude,longitude,logo link
0,"Texas A&M University, College Station",United States,tamu.edu,College Station,163,George Bush Drive,1876.0,Scope. The flagship of the 18-member Texas A&M...,73267.0,56527.0,16740.0,5861.0,30.627777,-96.33417,https://www.shanghairanking.cn/_uni/logo/df034...
1,Ohio State University,United States,osu.edu,Columbus,68,"Student Academic Services Building, 281 W. Lan...",1870.0,"Since 1870, The Ohio State University has been...",61492.0,50293.0,11199.0,7173.0,39.961113,-82.998886,https://www.shanghairanking.cn/_uni/logo/1ccdf...
2,University of Texas at Arlington,United States,uta.edu,Arlington,707,,,,60035.0,42763.0,17272.0,9005.0,32.735554,-97.10778,https://www.shanghairanking.cn/_uni/logo/7c212...
3,New York University,United States,nyu.edu,New York,27,"New York University, 70 Washington Square Sout...",1831.0,,58091.0,29902.0,28189.0,19170.0,40.71417,-74.006386,https://www.shanghairanking.cn/_uni/logo/d254e...
4,University of Central Florida,United States,ucf.edu,Florida,479,4000 Central Florida Blvd.,1963.0,The University of Central Florida (UCF) is a p...,56740.0,51717.0,5023.0,2476.0,18.365278,-66.5675,https://www.shanghairanking.cn/_uni/logo/813d8...


### Variable decisions

The chosen Y variable is investment success- expressed in values such as the average funding rounds, the value of investment raised, and the number of milestones achieved. With our current data set, these are convenient and easy to calculate-investment success measures. They are also relatively easy to interpret for example, higher average funding rounds can be indicative of increased investor-investee interactions. 

For the x values, we will consider the number of startups per state, the proximity to elite universities, entity type (company, person…), the category of the startup, the city (industry hot-spot), and the number of high-ranking universities in the city. The number of startups per state tells us the density of startups between states, and studying this x variable may help us discover whether or not the concentration of startups is helpful or detrimental to the average funds that each startup receives. 
The city (industry hot-spot) variable helps us analyze and account for skewed data for example California as a technology hot spot may suggest companies that are in similar or related fields that are located there may benefit from better-skilled workers, and possibly better funding. This is closely related to the next x variable; the category of the startup. There may be disparities between categories of entities and this may influence investment success. The number of high-ranking universities in the city is a necessary variable as it alerts us as to whether or not we can directly answer the research question for those US cities as the research question is focused on top-tier universities.

### Summary Statistics

In [18]:
#Summary Stats
summary = main.describe().T
summary = summary.loc[['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones']]
summary.head()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
investment_rounds,231381.0,0.2036943,6.010564,0.0,0.0,0.0,0.0,478.0
invested_companies,231381.0,0.1878503,5.636464,0.0,0.0,0.0,0.0,459.0
funding_rounds,231381.0,4.599993,2.476528,1.0,3.0,4.0,6.0,14.0
funding_total_usd,231381.0,86460270.0,214634800.0,0.0,12000000.0,37800000.0,82000000.0,5700000000.0
milestones,231381.0,2.139307,1.402732,0.0,1.0,2.0,3.0,9.0


The above table presents the basic summary statistics available from the main data set. We see values like the average funding rounds, the average investment rounds and other elementary statistics.  

### Subsetted Summary Statistics

In [20]:
#Average founding rounds per City/State

summary_table_1 = main.groupby("state_code").agg(
    Total_Funding_Rounds=('funding_rounds', 'sum'),
    Avg_Funding_Rounds=('funding_rounds', 'mean'),
    Num_Companies=('funded_object_id', 'count') )

summary_table_1 = summary_table_1.reset_index()
summary_table_1.head()

Unnamed: 0,state_code,Total_Funding_Rounds,Avg_Funding_Rounds,Num_Companies
0,AL,28,1.272727,22
1,AR,139,2.355932,59
2,AZ,2939,3.335982,881
3,CA,591569,4.747326,124611
4,CO,18459,4.230804,4363


This table gives the average funding rounds per companies by state. These values are necessary as a priliminary introduction to the role of location in startup success. As demonstrated in the table, the more startups that exist the greater the average funding round and the larger the total funding.

In [18]:
summary_table_2 = main.groupby("city").agg(
    Total_Amt_Raised=('raised_amount_usd', 'sum'),
    Avg_Amt_Raised=('raised_amount_usd', 'mean'),
    Num_Companies=('funded_object_id', 'count') )

summary_table_2 = summary_table_2.reset_index()
summary_table_2.head()

Unnamed: 0,city,Total_Amt_Raised,Avg_Amt_Raised,Num_Companies
0,"(Oct. 01, 2011 - Sep. 30, 2012)",0.0,0.0,16
1,ALLSTON,300000.0,150000.0,2
2,ATLANTA,0.0,0.0,2
3,AUSTIN,21619990.0,3603332.0,6
4,Acton,4807360000.0,10612270.0,453


In [19]:
uni.info()

<class 'pandas.core.frame.DataFrame'>
Index: 321 entries, 0 to 332
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   university              321 non-null    object 
 1   country                 321 non-null    object 
 2   domain                  320 non-null    object 
 3   city                    321 non-null    object 
 4   ranking                 321 non-null    int64  
 5   address                 175 non-null    object 
 6   foundation year         175 non-null    float64
 7   description             112 non-null    object 
 8   total students          178 non-null    float64
 9   undergraduate students  178 non-null    float64
 10  graduate students       177 non-null    float64
 11  international students  141 non-null    float64
 12  latitude                321 non-null    float64
 13  longitude               321 non-null    float64
 14  logo link               182 non-null    object 

In [None]:
# Function to calculate distance between two points using latitude and longitude
def calculate_distance(lat1, lon1, lat2, lon2):
    return geodesic((lat1, lon1), (lat2, lon2)).kilometers

# Calculate the distance between each startup and the nearest university within the same state
distances = []
for i, startup in main.iterrows():
    min_distance = float('inf')
    for j, university in uni.iterrows():
        if startup['city'] == university['city']:
            distance = calculate_distance(startup['latitude'], startup['longitude'], university['latitude'], university['longitude'])
            if distance < min_distance:
                min_distance = distance
    distances.append(min_distance if min_distance != float('inf') else None)

# Add the distances to the startup data
main['distance_to_nearest_university'] = distances

# Drop rows where distance could not be calculated (i.e., no university in the same state)
main = main.dropna(subset=['distance_to_nearest_university'])

# Save the updated startup data to a new CSV file
#startup_data.to_csv('updated_startup_data.csv', index=False)

#print("Distances calculated and saved to updated_startup_data.csv")
main.head()

In [None]:
# 1. Scatter Plot – Average Funding Rounds by State 
filtered_data = summary_table[summary_table["Num_Companies"] < 100000]  # Adjusted threshold to remove outlier
plt.figure(figsize=(10, 6))
plt.scatter(filtered_data["Num_Companies"], filtered_data["Avg_Funding_Rounds"], alpha=0.7)

plt.xscale('log')  # Log scale for the X-axis
plt.yscale('linear')  # Keep Y-axis linear

# Remove the right and top spines
ax = plt.gca()  # Get current axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# Labels and title
plt.xlabel("Number of Companies (Log Scale)")
plt.ylabel("Average Funding Rounds")
plt.title("Average Funding Rounds vs. Number of Companies per State)")

plt.show()


The scatter plot above tells the relationship between the number of startups per state and the average funding rounds per state. The trend suggests that states with more startups see a higher and or consistent funding rounds on average. The plot has been ajusted in scale to remove the one outlier identified. This relationship may be as a result of concentration of investors around major states. 

In [None]:
#Average founding rounds per proximity (in ranges)


In [None]:
entity_type_stats = main["entity_type"].value_counts(normalize=True) * 100  # Percent distribution
entity_type_stats.head()

In [None]:
# Group by entity type
entity_stats = main.groupby("entity_type")[["avg_funding_rounds", "investment_raised", "milestones_achieved"]].mean()

# Plot bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=entity_stats.index, y=entity_stats["investment_raised"], palette="Greens_d")
plt.title("Investment Raised by Entity Type")
plt.xlabel("Entity Type")
plt.ylabel("Average Investment Raised ($)")
plt.show()

In [None]:
# City (Industry Hot-Spot)
city_stats = main["city"].value_counts().head(10)  # Top 10 cities
hotspot_share = (city_stats.sum() / len(main)) * 100  # % in top cities
hotspot_share.head()

In [None]:
uni_stats = uni["ranking"].describe()
uni_stats.head()

### Conclusion
To conclude, from the literature review and the data presented, we see that location decisions are important for startups. The scatter plot showed that being located in high-density startup locations provides room for higher average funding rounds than being in a low-density startup location. Further analysis is needed to look fully at the role of proximity to elite universities.

In [29]:
main.head()
main.info()

<class 'pandas.core.frame.DataFrame'>
Index: 132305 entries, 0 to 231380
Data columns (total 56 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   funded_object_id                132305 non-null  object 
 1   entity_type                     132305 non-null  object 
 2   entity_id                       132305 non-null  int64  
 3   parent_id                       0 non-null       object 
 4   name                            132305 non-null  object 
 5   normalized_name                 132305 non-null  object 
 6   permalink                       132305 non-null  object 
 7   category_code                   131736 non-null  object 
 8   status                          132305 non-null  object 
 9   founded_at                      125207 non-null  object 
 10  closed_at                       3069 non-null    object 
 11  domain                          131393 non-null  object 
 12  short_description    

In [33]:
main['city'].value_counts().head()

city
San Francisco    38240
New York         22160
Cambridge         5943
San Jose          5373
Boston            5069
Name: count, dtype: int64

### References
Anselin, L., Varga, A., & Acs, Z. (1997). Local Geographic Spillovers between University Research and High Technology Innovations. Journal of Urban Economics, 42(3),      422-448.

Audretsch, D. B., Lehmann, E. E., & Warning, S. (2005). University spillovers and new firm location. Research Policy, 34(7), 1113–1122.

Baptista, R., Mendonça, J. Proximity to knowledge sources and the location of knowledge-based start-ups. Ann Reg Sci 45, 5–29 (2010).

Bonaccorsi, A., Colombo, M.G., Guerini, M. et al. University specialization and new firm creation across industries. Small Bus Econ 41, 837–863 (2013).

Fritsch, M., & Aamoucke, R. (2017). Fields of knowledge in higher education institutions, and innovative start‐ups: An empirical investigation. Papers in Regional Science, 96, S1-S28. 
