# Analyzing Startup Fundraising Deals from Crunchbase

In this guided project, we'll assume we only have 10 megabytes of available memory. Then we will read the file **crunchbase-investments.csv**, with the size of 10.3 MB into SQLite and analyze the data.

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import chardet

In [2]:
preview = pd.read_csv('crunchbase-investments.csv', nrows=3)
preview

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000


Let's check the encoding of the file with the first 100000 bytes.

In [3]:
with open('crunchbase-investments.csv', "rb") as f:
    line = f.read(100000)
    print(chardet.detect(line)['encoding'])

Windows-1252


Since Windows-1252 is a subset of ISO-8859-1, we will use ISO-8859-1 in `encoding` when we read the file. Then, we will use 5,000 row chunks to read the file so that to ensure the consumption of memory is less than 10 MB. Meanwhile, we will check the number of missing values and memory footprint for each column.

In [4]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', 
                         encoding='ISO-8859-1',
                         chunksize=5000)
nrow = 0
missing_values = []
for chunk in chunk_iter:
    missing_values.append(chunk.isna().sum())
    nrow += len(chunk)
    
missing_value_all = pd.concat(missing_values)
missing_value_cv = missing_value_all.groupby(missing_value_all.index).sum()

In [5]:
print('''
Count of missing value in each column:

{}

Total number of row: {}
'''.format(missing_value_cv.sort_values(), nrow))


Count of missing value in each column:

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

Total number of row: 52870



`investor_category_code` has more than 50000 missing values while we only have 52870 rows of data. Therefore we should drop this column due to its high missing rate of information.

In [6]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', 
                         encoding='ISO-8859-1',
                         chunksize=5000)

memory = []
for chunk in chunk_iter:
    memory.append(chunk.memory_usage(deep=True))
    
memory_combined = pd.concat(memory)
memory_total = memory_combined.groupby(memory_combined.index).sum()/(1024**2)

In [7]:
print('''
Memory footprint for each column(MB):

{}

Total: {}
'''.format(memory_total, memory_total.sum()))


Memory footprint for each column(MB):

Index                     0.001381
company_category_code     3.262619
company_city              3.343512
company_country_code      3.025223
company_name              3.424955
company_permalink         3.869808
company_region            3.253541
company_state_code        2.962161
funded_at                 3.378091
funded_month              3.226837
funded_quarter            3.226837
funded_year               0.403366
funding_round_type        3.252704
investor_category_code    0.593590
investor_city             2.751430
investor_country_code     2.524654
investor_name             3.734270
investor_permalink        4.749821
investor_region           3.238946
investor_state_code       2.361876
raised_amount_usd         0.403366
dtype: float64

Total: 56.98898792266846



We will also drop `company_permalink` and `investor_permalink ` since they are URLs.

In [8]:
useful_col = preview.columns.drop(['company_permalink', 'investor_permalink', 'investor_category_code']).to_list()

In [9]:
useful_col

['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'funded_month',
 'funded_quarter',
 'funded_year',
 'raised_amount_usd']

## Column types

Then we will investigate the types of columns and determine if there is any more space efficient types.

In [10]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', 
                         encoding='ISO-8859-1',
                         usecols=useful_col,
                         chunksize=5000)

col_type = {}
for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_type:
            col_type[col] = [str(chunk.dtypes[col])]
        else:
            col_type[col].extend([str(chunk.dtypes[col])])
            
for col in col_type:    
    unique_type = set(col_type[col])
    col_type[col] = unique_type

In [11]:
col_type

{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

The type of `investor_country_code`, `investor_state_code` and `investor_city` have both *float64* and *object* while `funded_year` has *float64* and *int64*. From our preview table, we can notice that `investor_country_code`, `investor_state_code` and `investor_city` should be *object* and `funded_year` may be kept as *float64* due to missing values.

In [12]:
col_type = {
    'investor_country_code':'object',
    'investor_state_code':'object',
    'investor_city':'object'
}

## Connect to SQLite

In [13]:
conn = sqlite3.connect('crunchbase.db')

chunk_iter = pd.read_csv('crunchbase-investments.csv', 
                         encoding='ISO-8859-1',
                         usecols=useful_col,
                         dtype=col_type,
                         chunksize=5000)

for chunk in chunk_iter:
    chunk.to_sql('investments', conn, if_exists='append', index=False)

In [14]:
# return SQL query as pandas dataframe
def run_query(q):
    with sqlite3.connect('crunchbase.db') as conn:
        return pd.read_sql(q, conn)

In [15]:
# execute SQL command using sqlite module
def run_command(query):
    with sqlite3.connect('crunchbase.db') as conn:
        # autocommit
        conn.isolation_level = None
        conn.execute(c)

In [16]:
# return tables' name and type in chinook.db
def show_tables():
    query = """
    SELECT
        name,
        type
    FROM sqlite_master
    WHERE type IN ("table","view");
    """   
    return run_query(query)

In [17]:
test_query = 'SELECT * FROM investments LIMIT 3;'

In [18]:
run_query(test_query)

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,AdverCar,advertising,USA,CA,SF Bay,San Francisco,1-800-FLOWERS.COM,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000.0
1,LaunchGram,news,USA,CA,SF Bay,Mountain View,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000.0
2,uTaP,messaging,USA,,United States - Other,,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000.0


In [19]:
type_query = 'PRAGMA table_info(investments);'

In [20]:
run_query(type_query)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,company_name,TEXT,0,,0
1,1,company_category_code,TEXT,0,,0
2,2,company_country_code,TEXT,0,,0
3,3,company_state_code,TEXT,0,,0
4,4,company_region,TEXT,0,,0
5,5,company_city,TEXT,0,,0
6,6,investor_name,TEXT,0,,0
7,7,investor_country_code,TEXT,0,,0
8,8,investor_state_code,TEXT,0,,0
9,9,investor_region,TEXT,0,,0


## EDA 

Now we have our data in SQLite. The next step will be finding out some insights as below from our data;

1. What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
2. Which category of company attracted the most investments?
3. Which investor contributed the most money (across all startups)?
4. Which investors contributed the most money per startup?
5. Which funding round was the most popular? Which was the least popular?

### 1.1 Top 10% raise 51.44% of the total amount of funds

In [21]:
top10perc = '''
WITH 
raised_perc AS(
    SELECT 
        *, 
        raised_amount_usd/(
            SELECT SUM(raised_amount_usd) 
            FROM investments)*100 AS perc
    FROM investments
    ORDER BY perc DESC
    LIMIT CAST(
        (SELECT COUNT(*)
        FROM investments)*0.1 AS INTEGER)
    )

SELECT sum(perc) AS top10_perc_raise_proportion_perc
FROM raised_perc
'''

In [22]:
run_query(top10perc)

Unnamed: 0,top10_perc_raise_proportion_perc
0,51.435701


In [23]:
bottom10perc = '''
WITH 
raised_perc AS(
    SELECT 
        *, 
        raised_amount_usd/(
            SELECT SUM(raised_amount_usd) 
            FROM investments)*100 AS perc
    FROM investments
    ORDER BY perc 
    LIMIT CAST(
        (SELECT COUNT(*)
        FROM investments)*0.1 AS INTEGER)
    )

SELECT sum(perc) AS bottom10_perc_raise_proportion_perc
FROM raised_perc
'''

In [24]:
run_query(bottom10perc)

Unnamed: 0,bottom10_perc_raise_proportion_perc
0,0.022217


In [25]:
top1perc = '''
WITH 
raised_perc AS(
    SELECT 
        *, 
        raised_amount_usd/(
            SELECT SUM(raised_amount_usd) 
            FROM investments)*100 AS perc
    FROM investments
    ORDER BY perc DESC
    LIMIT CAST(
        (SELECT COUNT(*)
        FROM investments)*0.01 AS INTEGER)
    )

SELECT sum(perc) AS top1_perc_raise_proportion_perc
FROM raised_perc
'''

In [26]:
run_query(top1perc)

Unnamed: 0,top1_perc_raise_proportion_perc
0,19.833355


In [27]:
bottom1perc = '''
WITH 
raised_perc AS(
    SELECT 
        *, 
        raised_amount_usd/(
            SELECT SUM(raised_amount_usd) 
            FROM investments)*100 AS perc
    FROM investments
    ORDER BY perc 
    LIMIT CAST(
        (SELECT COUNT(*)
        FROM investments)*0.01 AS INTEGER)
    )

SELECT sum(perc) AS bottom1_perc_raise_proportion_perc
FROM raised_perc
'''

In [28]:
run_query(bottom1perc)

Unnamed: 0,bottom1_perc_raise_proportion_perc
0,


The top 10% raise around 51.44% of the total fund amount while the top1% raise around 19.83%. Meanwhile, the bottom 10% only raise 0.02% and the bottom 1% has nothing.

### Investments by category

In [29]:
cat_query = '''
SELECT 
    company_category_code AS Category,
    ROUND(sum(raised_amount_usd)/1000000000,2) AS raised_amount_usd_billion,
    count(*) AS Count
FROM investments
GROUP BY company_category_code
ORDER BY 2 DESC
LIMIT 10;
'''

In [30]:
run_query(cat_query)

Unnamed: 0,Category,raised_amount_usd_billion,Count
0,biotech,662.38,29706
1,software,438.51,43458
2,mobile,388.66,24402
3,cleantech,316.23,11688
4,enterprise,275.17,26934
5,web,240.86,30090
6,medical,152.2,7890
7,advertising,150.46,19200
8,ecommerce,135.4,13008
9,network_hosting,134.52,6450


Companies belong to the category **biotech** has the most amount of fund raised, 441.59billion while the category **software** has the most number of investment.

### Contribution of Investors

In [31]:
investor_query = '''
SELECT 
        investor_name,
        investor_country_code,
        investor_state_code,
        investor_region,
        investor_city, 
        SUM(raised_amount_usd)/1000000000 AS con_USD_billion,
        COUNT(*) AS n_investment,
        AVG(raised_amount_usd)/1000000 AS avg_con_per_invest_M
    FROM investments
    GROUP BY investor_name
    ORDER BY con_USD_billion DESC
    LIMIT 10;
'''

In [32]:
run_query(investor_query)

Unnamed: 0,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,con_USD_billion,n_investment,avg_con_per_invest_M
0,Kleiner Perkins Caufield & Byers,USA,CA,SF Bay,Menlo Park,67.306958,2358,29.914204
1,New Enterprise Associates,USA,CA,SF Bay,Menlo Park,58.155254,2670,22.129092
2,Accel Partners,USA,CA,SF Bay,Palo Alto,38.832757,1932,20.810695
3,Goldman Sachs,USA,NY,New York,New York,38.252754,738,55.925079
4,Sequoia Capital,USA,CA,SF Bay,Menlo Park,36.236414,2214,17.255435
5,Intel,USA,CA,SF Bay,Santa Clara,35.8152,108,397.946667
6,Google,USA,CA,SF Bay,Mountain View,34.8528,132,290.44
7,Time Warner,USA,NY,New York,New York,34.38,72,520.909091
8,Comcast,USA,PA,Philadelphia,Philadelphia,34.014,54,629.888889
9,Greylock Partners,USA,CA,SF Bay,Menlo Park,29.765898,1506,20.670762


The top 10 investors all come from USA and mainly located in CA. The top investor is **Kleiner Perkins Caufield & Byers**, across 1572 investments and the total amount of 44.87 billion.

### Average investment on each startup by investor

In [33]:
invest_query = '''
SELECT 
        investor_name,
        investor_country_code,
        investor_state_code,
        investor_region,
        investor_city, 
        AVG(raised_amount_usd)/1000000 AS avg_con_per_invest_M,
        MAX(raised_amount_usd)/1000000 AS max_con_M,
        SUM(raised_amount_usd)/1000000000 AS con_USD_billion,
        COUNT(*) AS n_investment
    FROM investments
    GROUP BY investor_name
    ORDER BY AVG(raised_amount_usd)/1000000 DESC
    LIMIT 10;
'''

In [34]:
run_query(invest_query)

Unnamed: 0,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,avg_con_per_invest_M,max_con_M,con_USD_billion,n_investment
0,Marlin Equity Partners,USA,,United States - Other,,2600.0,2600.0,15.6,6
1,BrightHouse,USA,CA,Los Angeles,Santa Monica,2350.0,3200.0,28.2,12
2,GI Partners,USA,CA,SF Bay,Menlo Park,1050.0,1050.0,6.3,6
3,Sprint Nextel,,,unknown,,833.333333,1500.0,15.0,18
4,Siemens PLM Software,,,unknown,,750.0,750.0,4.5,6
5,Comcast,USA,PA,Philadelphia,Philadelphia,629.888889,3200.0,34.014,54
6,Eagle River Holdings,USA,WA,Seattle,Kirkland,614.25,1500.0,14.742,30
7,Time Warner,USA,NY,New York,New York,520.909091,3200.0,34.38,72
8,Laurel Crown Partners,,,unknown,,450.0,450.0,2.7,6
9,Intel,USA,CA,SF Bay,Santa Clara,397.946667,3200.0,35.8152,108


**Marline Equity** has the highest average investment amount among all investors. Interestingly, they have made 4 investments with the same amount, 2600 million.

### Popularity of funding round

In [35]:
funding_round_query = '''
SELECT 
        funding_round_type,
        AVG(raised_amount_usd)/1000000 AS avg_raised_amount_USD_M,
        MAX(raised_amount_usd)/1000000 AS max_raised_amount_USD_M,
        SUM(raised_amount_usd)/1000000000 AS total_USD_billion,
        COUNT(*) AS count
    FROM investments
    GROUP BY funding_round_type
    ORDER BY AVG(raised_amount_usd)/1000000 DESC;
'''

In [36]:
run_query(funding_round_query)

Unnamed: 0,funding_round_type,avg_raised_amount_USD_M,max_raised_amount_USD_M,total_USD_billion,count
0,post-ipo,1066.124138,3200.0,185.5056,198
1,private-equity,51.794474,2600.0,96.959255,2142
2,series-c+,24.689099,950.0,1594.520785,65220
3,other,19.815051,750.0,111.043548,5784
4,venture,16.256568,1500.0,783.338979,53502
5,series-b,14.869847,300.0,769.960657,52764
6,series-a,6.469474,319.0,519.252905,83628
7,crowdfunding,1.622875,2.145,0.038949,30
8,angel,0.690136,1.475,29.77245,53934
9,,,,,18


**Post-ipo** raised the most funding amount in average, 1066Million, which is 20 times more than the second type, **private-equity**.