---
#### Analyzing Startup Fundraising Deals from Crunchbase
---

__Project Objectives:__ 

- To process the data in batches, on the assumption there is only 10MB available.
- To identify the percentage of missing values across each column.
- To identify the memory demand of the original source data.
- To remove columns with high % missing / memory demands.
- To optimise the dtypes of the data by analysing where they change across iterations over the data.
- To calculate the reduced memory demand of the optimised data.
- To create a database with the final data using SQLite.


__Original Data Source:__ 

https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv

In [1]:
import pandas as pd, sqlite3 as sql

In [2]:
# Read data in 5000 row chunks
df_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1')

In [3]:
# Identify no. of missing values for each col
missing_list = []
length = 0
for c in df_iter:
    length += len(c)
    missing_list.append(c.isnull().sum())
    
missing_combined = pd.concat(missing_list)
missing_consolodated = missing_combined.groupby(missing_combined.index).sum().sort_values(ascending = False)

In [4]:
print('Percentage of Missing Values Per Columns')
(100 * missing_consolodated / length).round(1)

Percentage of Missing Values Per Columns


investor_category_code    95.4
investor_state_code       31.8
investor_city             23.6
investor_country_code     22.7
raised_amount_usd          6.8
company_category_code      1.2
company_city               1.0
company_state_code         0.9
funding_round_type         0.0
funded_year                0.0
funded_month               0.0
funded_at                  0.0
funded_quarter             0.0
investor_name              0.0
investor_permalink         0.0
investor_region            0.0
company_region             0.0
company_permalink          0.0
company_name               0.0
company_country_code       0.0
dtype: float64

In [5]:
# Memory footprint of each col
df_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1')
counter = 0
memory_fp = pd.Series(dtype = 'float64')

for c in df_iter:
    if counter == 0:
        memory_fp = c.memory_usage(deep = True)
    else:
        memory_fp += c.memory_usage(deep = True)
    counter += 1
    
memory_fp = (memory_fp / (1024 **2)).sort_values(ascending = False)
print('Memory Footprint (MB)')
print('Total memory:', memory_fp.sum().round(2))
memory_fp.round(2)

Memory Footprint (MB)
Total memory: 56.99


investor_permalink        4.75
company_permalink         3.87
investor_name             3.73
company_name              3.42
funded_at                 3.38
company_city              3.34
company_category_code     3.26
company_region            3.25
funding_round_type        3.25
investor_region           3.24
funded_quarter            3.23
funded_month              3.23
company_country_code      3.03
company_state_code        2.96
investor_city             2.75
investor_country_code     2.52
investor_state_code       2.36
investor_category_code    0.59
raised_amount_usd         0.40
funded_year               0.40
Index                     0.00
dtype: float64

Given the memory demand and the lack of usefulness of the permalinks columns, and the number of missing values in investor_category_code (95.4%), we will drop these columns.

The following columns also have 20 - 30% of their values missing, but we will retain them for the meantime. 
- investor_state_code
- investor_city
- investor_country_code     

In [6]:
memory_drop = memory_fp[['investor_category_code','investor_permalink','company_permalink']].sum()
potential_drop_memory = memory_fp[['investor_state_code', 'investor_city', 'investor_country_code']].sum()

print('Total memory:', memory_fp.sum().round(2))
print('Memory reduction:', memory_drop.round(2))
print('Potential additional memory reduction:', potential_drop_memory.round(2))

Total memory: 56.99
Memory reduction: 9.21
Potential additional memory reduction: 7.64


In [7]:
# Drop cols
cols_drop = ['investor_category_code','investor_permalink','company_permalink']
cols_keep = c.columns.drop(cols_drop)
print('Retained Columns:')
cols_keep.to_list()

Retained Columns:


['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'funded_month',
 'funded_quarter',
 'funded_year',
 'raised_amount_usd']

In [8]:
# Column types
df_iter = pd.read_csv('crunchbase-investments.csv', chunksize = 5000, encoding='ISO-8859-1', usecols = cols_keep)

col_types = {}
for c in df_iter:
    for col in c.columns:
        if col not in col_types:
            col_types[col] = [str(c.dtypes[col])]
        else:
            col_types[col].append(str(c.dtypes[col]))
            
col_types_unique = {}
for k, v in col_types.items():
    col_types_unique[k] = set(col_types[k])

print('Column Types Across Iterations')
col_types_unique

Column Types Across Iterations


{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

In [9]:
# Check the table
c

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52865,Garantia Data,enterprise,USA,CA,SF Bay,Santa Clara,Zohar Gilon,,,unknown,,series-a,2012-08-08,2012-08,2012-Q3,2012,3800000.0
52866,DudaMobile,mobile,USA,CA,SF Bay,Palo Alto,Zohar Gilon,,,unknown,,series-c+,2013-04-08,2013-04,2013-Q2,2013,10300000.0
52867,SiteBrains,software,USA,CA,SF Bay,San Francisco,zohar israel,,,unknown,,angel,2010-08-01,2010-08,2010-Q3,2010,350000.0
52868,Comprehend Systems,enterprise,USA,CA,SF Bay,Palo Alto,Zorba Lieberman,,,unknown,,series-a,2013-07-11,2013-07,2013-Q3,2013,8400000.0


In [10]:
# Final DataFrame
col_types = {
    "company_name": "category", "company_category_code": "category", "company_country_code": "category",
    "company_state_code": "category", "company_city": "category", "investor_name": "category",
    "investor_country_code": "category", "investor_state_code": "category",
    "investor_city": "category", "funding_round_type": "category", "raised_amount_usd": "float64"
}
use_cols = [
    "company_name", "company_category_code", "company_country_code", "company_state_code","company_city",
    "investor_name", "investor_country_code", "investor_state_code", "investor_city",
    "funding_round_type", "funded_at", "raised_amount_usd"
]

In [11]:
# Memory footprint of finasl DF
df_iter_final = pd.read_csv('crunchbase-investments.csv', usecols = use_cols, 
                            dtype = col_types, parse_dates = ['funded_at'], 
                            encoding = 'ISO-8859-1', chunksize = 5000)

counter = 0
memory_final = pd.Series(dtype = 'float64')

for c in df_iter_final:
    if counter == 0:
        memory_final = c.memory_usage(deep = True)
    else:
        memory_final += c.memory_usage(deep = True)
    counter += 1
    
memory_final = (memory_final / (1024 **2)).sort_values(ascending = False)
print('Memory Footprint (MB)')
print('Total memory:', memory_final.sum().round(2))
memory_final.round(2)

Memory Footprint (MB)
Total memory: 6.34


company_name             2.99
investor_name            1.16
company_city             0.62
raised_amount_usd        0.40
funded_at                0.40
investor_city            0.30
company_category_code    0.09
company_state_code       0.09
investor_state_code      0.08
investor_country_code    0.08
funding_round_type       0.06
company_country_code     0.05
Index                    0.00
dtype: float64

In [12]:
memory_reduction = {}

for row in memory_fp.index:
    if row not in memory_final.index:
        reduction = memory_fp[row]
    else:
        reduction = memory_fp[row] - memory_final[row]
    memory_reduction.update({row: reduction.round(2)})

print('Final Memory Reduction')
print('_______________________')
print('Total:', sum(memory_reduction.values()).round(2))
print('Percentage:', 
      round(100 * sum(memory_reduction.values()) / memory_fp.sum(), 2),
     '%')
pd.Series(memory_reduction)

Final Memory Reduction
_______________________
Total: 50.64
Percentage: 88.86 %


investor_permalink        4.75
company_permalink         3.87
investor_name             2.57
company_name              0.44
funded_at                 2.97
company_city              2.72
company_category_code     3.17
company_region            3.25
funding_round_type        3.19
investor_region           3.24
funded_quarter            3.23
funded_month              3.23
company_country_code      2.97
company_state_code        2.87
investor_city             2.45
investor_country_code     2.45
investor_state_code       2.28
investor_category_code    0.59
raised_amount_usd         0.00
funded_year               0.40
Index                     0.00
dtype: float64

In [13]:
# Connect to SQLite3 and create new database
conn = sql.connect('crunchbase.db')

for c in df_iter_final:
    c.to_sql('investments', conn, if_exists = 'append', index = False)

In [14]:
# Test the database
pd.read_sql('''SELECT * FROM investments;''', conn)

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,AdverCar,advertising,USA,CA,SF Bay,San Francisco,1-800-FLOWERS.COM,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012.0,2000000.0
1,LaunchGram,news,USA,CA,SF Bay,Mountain View,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012.0,20000.0
2,uTaP,messaging,USA,,United States - Other,,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012.0,20000.0
3,ZoopShop,software,USA,OH,Columbus,columbus,10Xelerator,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012.0,20000.0
4,eFuneral,web,USA,OH,Cleveland,Cleveland,10Xelerator,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011.0,20000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370085,Garantia Data,enterprise,USA,CA,,Santa Clara,Zohar Gilon,,,,,series-a,2012-08-08 00:00:00,,,,3800000.0
370086,DudaMobile,mobile,USA,CA,,Palo Alto,Zohar Gilon,,,,,series-c+,2013-04-08 00:00:00,,,,10300000.0
370087,SiteBrains,software,USA,CA,,San Francisco,zohar israel,,,,,angel,2010-08-01 00:00:00,,,,350000.0
370088,Comprehend Systems,enterprise,USA,CA,,Palo Alto,Zorba Lieberman,,,,,series-a,2013-07-11 00:00:00,,,,8400000.0


In [15]:
# Check the data types
pd.read_sql('''PRAGMA table_info(investments);''', conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,company_name,TEXT,0,,0
1,1,company_category_code,TEXT,0,,0
2,2,company_country_code,TEXT,0,,0
3,3,company_state_code,TEXT,0,,0
4,4,company_region,TEXT,0,,0
5,5,company_city,TEXT,0,,0
6,6,investor_name,TEXT,0,,0
7,7,investor_country_code,TEXT,0,,0
8,8,investor_state_code,TEXT,0,,0
9,9,investor_region,TEXT,0,,0


__SQL Database Analysis__

In [16]:
# What proportion of the total funds did the top 10% raise? 
total_invested = pd.read_sql('''SELECT SUM(raised_amount_usd) 
                                FROM investments;''', conn)
funding_total = (total_invested.values[0][0] / 1e9).astype(int)
print('Total Investment:', funding_total, 'billion dollars')

Total Investment: 4772 billion dollars


In [17]:
top_10_raised=pd.read_sql("""
SELECT iv.company_name,
    CAST(SUM(raised_amount_usd) AS DOUBLE)/(SELECT CAST(SUM(raised_amount_usd) AS BIGINT) from investments) as percentage_funding,
    CAST(SUM(raised_amount_usd) AS BIGINT) AS funding_amount
    FROM investments AS iv
    GROUP BY iv.company_name 
    ORDER BY funding_amount desc
    LIMIT (SELECT CAST(COUNT(distinct company_name) * 0.1 AS INT) FROM investments)
    """,conn)

funding_top10 = top_10_raised["funding_amount"].sum() / 1e9
pct_top10 = funding_top10 / funding_total
print("funding raised by top 10 percent %.2f billion dollars"%(funding_top10))
print("funding raised by top 10 percent %.2f percent "%(100 *pct_top10))

funding raised by top 10 percent 3203.42 billion dollars
funding raised by top 10 percent 67.13 percent 


In [18]:
# What about the top 1%? 
top_1_raised = pd.read_sql("""
SELECT iv.company_name,
    CAST(SUM(raised_amount_usd) AS DOUBLE)/(SELECT CAST(SUM(raised_amount_usd) AS BIGINT) from investments) as percentage_funding,
    CAST(SUM(raised_amount_usd) AS BIGINT) AS funding_amount
    FROM investments AS iv
    GROUP BY iv.company_name 
    ORDER BY funding_amount desc
    LIMIT (SELECT CAST(COUNT(distinct company_name) * 0.01 AS INT) FROM investments)
    """,conn)

funding_top1 = top_1_raised["funding_amount"].sum() / 1e9
pct_top1 = funding_top1 / funding_total
print("funding raised by top 10 percent %.2f billion dollars"%(funding_top1))
print("funding raised by top 10 percent %.2f percent "%(100 *pct_top1))

funding raised by top 10 percent 1251.14 billion dollars
funding raised by top 10 percent 26.22 percent 


In [19]:
# Compare these values to the proportions the bottom 10% and bottom 1% raised.

# Bottom 10%
bottom_10_raised = pd.read_sql("""
SELECT iv.company_name,
    CAST(SUM(raised_amount_usd) AS DOUBLE)/(SELECT CAST(SUM(raised_amount_usd) AS BIGINT) from investments) as percentage_funding,
    CAST(SUM(raised_amount_usd) AS BIGINT) AS funding_amount
    FROM investments AS iv
    GROUP BY iv.company_name
    HAVING funding_amount IS NOT NULL
    ORDER BY funding_amount asc
    LIMIT (SELECT CAST(COUNT(distinct company_name) * 0.1 AS INT) FROM investments)
    """,conn)

funding_bottom10 = bottom_10_raised["funding_amount"].sum() / 1e9
pct_bottom10 = funding_bottom10 / funding_total
print("funding raised by bottom 10 percent %.2f billion dollars"%(funding_bottom10))
print("funding raised by bottom 10 percent %.2f percent "%(100 *pct_bottom10))
print('\n')

# Bottom 1%
bottom_1_raised = pd.read_sql("""
SELECT iv.company_name,
    CAST(SUM(raised_amount_usd) AS DOUBLE)/(SELECT CAST(SUM(raised_amount_usd) AS BIGINT) from investments) as percentage_funding,
    CAST(SUM(raised_amount_usd) AS BIGINT) AS funding_amount
    FROM investments AS iv
    GROUP BY iv.company_name
    HAVING funding_amount IS NOT NULL
    ORDER BY funding_amount asc
    LIMIT (SELECT CAST(COUNT(distinct company_name) * 0.01 AS INT) FROM investments)
    """,conn)

funding_bottom1 = bottom_1_raised["funding_amount"].sum() / 1e9
pct_bottom1 = funding_bottom1 / funding_total
print("funding raised by bottom 1 percent %.2f billion dollars"%(funding_bottom1))
print("funding raised by bottom 1 percent %.2f percent "%(100 *pct_bottom1))

funding raised by bottom 10 percent 1.77 billion dollars
funding raised by bottom 10 percent 0.04 percent 


funding raised by bottom 1 percent 0.01 billion dollars
funding raised by bottom 1 percent 0.00 percent 


In [20]:
# Which category of company attracted the most investments?
range_no = 5

investment = pd.read_sql('''
SELECT iv.company_category_code, COUNT(*) as frequency
FROM investments iv
GROUP BY iv.company_category_code
ORDER BY frequency DESC
LIMIT {} ;'''.format(range_no), conn)

print('Number of Investors Per Category')
for i in range(range_no):
    print(
        "%s: %d"
        %(investment["company_category_code"][i],
          investment["frequency"][i]))

Number of Investors Per Category
software: 50701
web: 35105
biotech: 34657
enterprise: 31423
mobile: 28469


In [21]:
# Which investor contributed the most money (across all startups)?
no_top_funders = 10

top_funders = pd.read_sql('''
SELECT iv.investor_name, COUNT(*) as frequency, 
ROUND(CAST(SUM(raised_amount_usd) AS BIGINT) / 1e9, 2) AS billion_invested
FROM investments AS iv
GROUP BY iv.investor_name
ORDER BY frequency DESC
LIMIT {} ;'''.format(no_top_funders), conn)

top_funders.index = top_funders.reset_index().index + 1
top_funders['million_per_investment'] = round(1000 * top_funders['billion_invested'] / top_funders['frequency'], 2)
top_funders

Unnamed: 0,investor_name,frequency,billion_invested,million_per_investment
1,New Enterprise Associates,3115,67.85,21.78
2,SV Angel,3052,12.5,4.1
3,Kleiner Perkins Caufield & Byers,2751,78.52,28.54
4,Sequoia Capital,2583,42.28,16.37
5,Draper Fisher Jurvetson (DFJ),2520,31.51,12.5
6,Intel Capital,2317,32.87,14.19
7,First Round Capital,2282,13.41,5.88
8,Accel Partners,2254,45.3,20.1
9,Techstars,1869,0.49,0.26
10,500 Startups,1778,3.06,1.72


In [22]:
# Which investors contributed the most money per startup?
no_top_funders = 10

top_funders = pd.read_sql('''
SELECT iv.investor_name, COUNT(*) as frequency, 
ROUND(CAST(SUM(raised_amount_usd) AS BIGINT) / 1e9, 2) AS billion_invested
FROM investments AS iv
GROUP BY iv.investor_name
ORDER BY billion_invested / frequency DESC
LIMIT {} ;'''.format(no_top_funders), conn)

top_funders.index = top_funders.reset_index().index + 1
top_funders['million_per_investment'] = round(1000 * top_funders['billion_invested'] / top_funders['frequency'], 2)
top_funders

Unnamed: 0,investor_name,frequency,billion_invested,million_per_investment
1,Marlin Equity Partners,7,18.2,2600.0
2,BrightHouse,14,32.9,2350.0
3,GI Partners,7,7.35,1050.0
4,Sprint Nextel,21,17.5,833.33
5,Siemens PLM Software,7,5.25,750.0
6,Comcast,63,39.68,629.84
7,Eagle River Holdings,35,17.2,491.43
8,Time Warner,84,40.11,477.5
9,Laurel Crown Partners,7,3.15,450.0
10,Digital Sky Technologies,56,20.45,365.18


In [23]:
# Which funding round was the most popular? Which was the least popular?
no_rounds = 10

top_rounds = pd.read_sql('''
SELECT iv.funding_round_type AS funding_round, COUNT(*) as frequency
FROM investments AS iv
GROUP BY funding_round
ORDER BY frequency DESC
LIMIT {} ;'''.format(no_rounds), conn)

top_rounds.index = top_rounds.reset_index().index + 1
top_rounds

Unnamed: 0,funding_round,frequency
1,series-a,97566
2,series-c+,76090
3,angel,62923
4,venture,62419
5,series-b,61558
6,other,6748
7,private-equity,2499
8,post-ipo,231
9,crowdfunding,35
10,,21
