The following is an exploratory data analysis of real-world data from start-up companies which have received investment, either privately alone, or also with public offerings. We will explore the kinds of sectors these companies work in, and analysis the best/worst sectors for a start-up.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import sys

np.set_printoptions(threshold=sys.maxsize)
pd.set_option('display.max_columns',999)
pd.set_option('display.max_rows',999)

In [6]:
df = pd.read_csv("investments2.csv", encoding = "ISO-8859-1", engine='python')
df.head()
# we have information on name, url, category, market, acquired/operating/closed status, and geographical location.

Unnamed: 0,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,funding_rounds,founded_at,founded_month,founded_quarter,founded_year,first_funding_at,last_funding_at,seed,venture,equity_crowdfunding,undisclosed,convertible_note,debt_financing,angel,grant,private_equity,post_ipo_equity,post_ipo_debt,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
0,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,News,1750000,acquired,USA,NY,New York City,New York,1.0,2012-06-01,2012-06,2012-Q2,2012.0,2012-06-30,2012-06-30,1750000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,Los Angeles,2.0,,,,,2010-06-04,2010-09-23,0.0,4000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Publishing,40000,operating,EST,,Tallinn,Tallinn,1.0,2012-10-26,2012-10,2012-Q4,2012.0,2012-08-09,2012-08-09,40000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|i...,Electronics,1500000,operating,GBR,,London,London,1.0,2011-04-01,2011-04,2011-Q2,2011.0,2011-04-01,2011-04-01,1500000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-R- Ranch and Mine,,|Tourism|Entertainment|Games|,Tourism,60000,operating,USA,TX,Dallas,Fort Worth,2.0,2014-01-01,2014-01,2014-Q1,2014.0,2014-08-17,2014-09-26,0.0,0.0,60000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
df['status'].value_counts()
# most of our start-ups are still operating. We will try to see which markets have the most start-ups which have been acquired, 
# vs the markets where the most start-ups have closed.

operating    41829
acquired      3692
closed        2603
Name: status, dtype: int64

In [8]:
df1 = df.drop(['name','homepage_url'], axis = 1)
# name and url can be dropped, they tell us nothing.

df1.iloc[4].isnull().sum() # checks how many NA values in an observation

0

In [9]:
df1['funding_total_usd'] = df1['funding_total_usd'].apply(lambda x: str(x).replace(',',''))
# Funding has been encoded with commas - here we remove them.

Let's take a look at the kinds of markets that these start-ups are operating in, as well as the Operating/Acquired/Closed in these markets.

In [28]:
dfa = df1[df1['status'] == 'acquired']     # subset 'acquired'
dfo = df1[df1['status'] == 'operating']    # subset 'operating'
dfc = df1[df1['status'] == 'closed']       # subset 'closed'

do = pd.DataFrame(data = dfo['market'].value_counts())  # dataframe of 'operating', value count of markets
do.columns = ['Operating']

da = pd.DataFrame(data = dfa['market'].value_counts())  # dataframe of 'acquired', value count of markets
da.columns = ['Acquired']

dc = pd.DataFrame(data = dfc['market'].value_counts())  # dataframe of 'closed', value count of markets
dc.columns = ['Closed']

dm = pd.DataFrame(data = df['market'].value_counts())  # dataframe of total markets
dm.columns = ['Total']

ds = pd.concat([do,da,dc,dm], axis = 1)

ds.head(15)

Unnamed: 0,Operating,Acquired,Closed,Total
Software,3809.0,464.0,254.0,4620
Biotechnology,3266.0,183.0,139.0,3688
Mobile,1602.0,198.0,144.0,1983
E-Commerce,1551.0,99.0,89.0,1805
Curated Web,1171.0,206.0,253.0,1655
Health Care,1067.0,67.0,51.0,1207
Clean Technology,1026.0,55.0,83.0,1200
Enterprise Software,1007.0,200.0,49.0,1280
Games,934.0,107.0,120.0,1182
Hardware + Software,928.0,75.0,62.0,1081


Let's restrict ourselves to markets where we have at least 60 start-ups.

In [29]:
ds_short = ds[ds['Total'] > 60]

ds_short.head(15)

Unnamed: 0,Operating,Acquired,Closed,Total
Software,3809.0,464.0,254.0,4620
Biotechnology,3266.0,183.0,139.0,3688
Mobile,1602.0,198.0,144.0,1983
E-Commerce,1551.0,99.0,89.0,1805
Curated Web,1171.0,206.0,253.0,1655
Health Care,1067.0,67.0,51.0,1207
Clean Technology,1026.0,55.0,83.0,1200
Enterprise Software,1007.0,200.0,49.0,1280
Games,934.0,107.0,120.0,1182
Hardware + Software,928.0,75.0,62.0,1081


In [14]:
len(ds_short)

90

We are left with 90 markets, each of which has at least 60 instances of a start-up. This is workable.

To be certain about our numbers, we'll do some calculations to see if we're missing a lot of data.

In [30]:
oac_sum = pd.DataFrame(data = ds_short.drop(['Total'], axis = 1).sum(axis = 1))  # drop Total
oac_sum.columns = ['Added']  # add columns across

together = pd.concat([ds_short, oac_sum], axis = 1)

together['Missing'] = together['Total'] - together['Added']  # how many missing?
together['Missing_Perc'] = round(together['Missing'] / together['Total'],2)  # to get percentage

together.head(15)

Unnamed: 0,Operating,Acquired,Closed,Total,Added,Missing,Missing_Perc
Software,3809.0,464.0,254.0,4620,4527.0,93.0,0.02
Biotechnology,3266.0,183.0,139.0,3688,3588.0,100.0,0.03
Mobile,1602.0,198.0,144.0,1983,1944.0,39.0,0.02
E-Commerce,1551.0,99.0,89.0,1805,1739.0,66.0,0.04
Curated Web,1171.0,206.0,253.0,1655,1630.0,25.0,0.02
Health Care,1067.0,67.0,51.0,1207,1185.0,22.0,0.02
Clean Technology,1026.0,55.0,83.0,1200,1164.0,36.0,0.03
Enterprise Software,1007.0,200.0,49.0,1280,1256.0,24.0,0.02
Games,934.0,107.0,120.0,1182,1161.0,21.0,0.02
Hardware + Software,928.0,75.0,62.0,1081,1065.0,16.0,0.01


We see that for most markets, we are missing only a small percentage of data. This shouldn't cause us any problems.

Next, let's have a look at Operating/Acquired/Closed by percentage.

In [18]:
# new columns of percentages

together['Operating_Perc'] = round(together['Operating'] / together['Added'],2) 
together['Acquired_Perc'] = round(together['Acquired'] / together['Added'],2)
together['Closed_Perc'] = round(together['Closed'] / together['Added'],2)

Below is a list of the top 20 markets with the highest percentage of start-ups still operating.

In [19]:
together['Operating_Perc'].sort_values(ascending = False).head(20)

Internet of Things     0.99
Transportation         0.97
Finance Technology     0.97
Real Estate            0.96
Digital Media          0.96
Medical                0.95
Manufacturing          0.95
Local Businesses       0.95
Retail                 0.94
Crowdsourcing          0.94
Pharmaceuticals        0.94
Fashion                0.94
Restaurants            0.94
Financial Services     0.94
Health and Wellness    0.94
Big Data               0.94
Entertainment          0.94
Education              0.94
Technology             0.94
Hospitality            0.94
Name: Operating_Perc, dtype: float64

Below is a list of the top 20 markets with the highest percentage of start-ups which have been acquired.

In [21]:
together['Acquired_Perc'].sort_values(ascending = False).head(20)

Web Hosting                0.19
Wireless                   0.19
Semiconductors             0.18
Security                   0.17
Video Streaming            0.17
iPhone                     0.17
Shopping                   0.16
Enterprise Software        0.16
Public Relations           0.14
Payments                   0.14
Video                      0.14
Curated Web                0.13
Publishing                 0.13
Messaging                  0.13
Facebook Applications      0.13
Advertising                0.13
Cloud Computing            0.12
Sales and Marketing        0.12
Location Based Services    0.12
Search                     0.12
Name: Acquired_Perc, dtype: float64

And finally, below is a list of the top 20 markets with the highest percentage of start-ups which have closed.

In [23]:
together['Closed_Perc'].sort_values(ascending = False).head(20)

Public Relations           0.20
Location Based Services    0.17
Facebook Applications      0.17
Curated Web                0.16
iPhone                     0.16
Social Network Media       0.14
Web Development            0.13
Music                      0.12
Messaging                  0.12
Networking                 0.12
Shopping                   0.12
Social Media               0.11
Internet Marketing         0.11
Android                    0.11
Games                      0.10
Search                     0.10
Video                      0.09
Advertising                0.08
Sales and Marketing        0.08
Sports                     0.08
Name: Closed_Perc, dtype: float64

So we can see that, from a start-up point of view, there are certain sectors in which it is better to start your business in, both in terms of maintaining business and in being acquired, and similarly, there are certain sectors in which the chances of the business closing down are significantly higher than in others.