# Analyzing Startup Fundraising Deals from Crunchbase

## Introduction

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.

The data set of investments we'll be exploring is current as of October 2013. You can download it [from GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, we know from earlier missions that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

## Importing packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlite3

In [2]:
pd.options.display.max_columns = 99
%matplotlib inline

## Defining chunksize for pandas

As we only have 10Mb of available memory, the chunksize parameter needs to be set so that we can read the CSV file without problems.

Let's start with a chunksize of 5000. The memory usage should be around 5Mb (~50% of the max memory)

In [3]:
chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1')

memory = [chunk.memory_usage(deep=True).sum()/2**20 for chunk in chunk_iter]
print("Memory Usage per chunk:", memory)
print("Min Memory Usage per chunk:", min(memory))
print("Max Memory Usage per chunk:", max(memory))
total_memory = sum(memory)
print("\nTotal Memory:",total_memory)

Memory Usage per chunk: [5.579195022583008, 5.528186798095703, 5.535004615783691, 5.528155326843262, 5.524299621582031, 5.553397178649902, 5.531391143798828, 5.509613037109375, 5.396082878112793, 4.63945198059082, 2.663668632507324]
Min Memory Usage per chunk: 2.663668632507324
Max Memory Usage per chunk: 5.579195022583008

Total Memory: 56.98844623565674


In [4]:
# Columns memory footprint
chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1')
col_footprint_comb = pd.concat([chunk.memory_usage(deep=True)/2**20 for chunk in chunk_iter])
col_footprint = col_footprint_comb.groupby(col_footprint_comb.index).sum()
print("Memory Usage per column:\n",col_footprint)

Memory Usage per column:
 Index                     0.000877
company_category_code     3.262619
company_city              3.343493
company_country_code      3.025223
company_name              3.424955
company_permalink         3.869808
company_region            3.253522
company_state_code        2.962161
funded_at                 3.378091
funded_month              3.226837
funded_quarter            3.226837
funded_year               0.403366
funding_round_type        3.252704
investor_category_code    0.593590
investor_city             2.751430
investor_country_code     2.524654
investor_name             3.734270
investor_permalink        4.749821
investor_region           3.238946
investor_state_code       2.361876
raised_amount_usd         0.403366
dtype: float64


A **chunksize of 5000** seems good for our purpose.

## Exploring columns

Let's dig a bit on the different columns:

In [5]:
# Number of rows
chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1')
n_rows = sum([len(chunk) for chunk in chunk_iter])
print("Number of rows:",n_rows)

Number of rows: 52870


In [6]:
# Missing values
chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1')
missing_comb = pd.concat([chunk.isnull().sum() for chunk in chunk_iter])
missing_values = missing_comb.groupby(missing_comb.index).sum().sort_values()
missing_values

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

In [7]:
# Data types
chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1')
chunk_iter.get_chunk(20).dtypes

company_permalink          object
company_name               object
company_category_code      object
company_country_code       object
company_state_code         object
company_region             object
company_city               object
investor_permalink         object
investor_name              object
investor_category_code     object
investor_country_code      object
investor_state_code        object
investor_region            object
investor_city              object
funding_round_type         object
funded_at                  object
funded_month               object
funded_quarter             object
funded_year                 int64
raised_amount_usd         float64
dtype: object

In [8]:
# Exploring content
chunk_iter.get_chunk(5)

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
20,/company/ecotality,ECOtality,cleantech,USA,CA,SF Bay,San Francisco,/company/abb,ABB,cleantech,CHE,,Zurich,Zurich,venture,2011-01-09,2011-01,2011-Q1,2011,10000000
21,/company/evalve,Evalve,biotech,USA,CA,SF Bay,Menlo Park,/company/abbott,Abbott Labs,biotech,USA,IL,Chicago,Des Plaines,series-c+,2007-11-27,2007-11,2007-Q4,2007,60000000
22,/company/ovalis,Ovalis,biotech,USA,CA,SF Bay,Mountain View,/company/abbott,Abbott Labs,biotech,USA,IL,Chicago,Des Plaines,series-b,2007-01-30,2007-01,2007-Q1,2007,6600000
23,/company/alvine-pharmaceuticals,Alvine Pharmaceuticals,biotech,USA,CA,SF Bay,San Carlos,/company/abbvie,AbbVie,biotech,USA,IL,Chicago,Chicago,private-equity,2013-05-14,2013-05,2013-Q2,2013,70000000
24,/company/avaxia-biologics,Avaxia Biologics,biotech,USA,MA,Boston,Lexington,/company/abbvie,AbbVie,biotech,USA,IL,Chicago,Chicago,series-b,2013-06-07,2013-06,2013-Q2,2013,11500000


In [9]:
# Getting column names
col_names = chunk_iter.get_chunk(1).columns

In [10]:
# # Exploring unique values
# unique = {}
# for col in col_names:
#     unique[col] = []

# chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1')
# for chunk in chunk_iter:
#     for col in chunk.columns:
#         unique_vals = chunk[col].unique()
#         unique[col].extend([v for v in unique_vals if v not in unique[col] and v is not np.nan])

In [11]:
# Exploring unique values
value_counts = {}
for col in col_names:
    value_counts[col] = []

chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1')    
for chunk in chunk_iter:
    for col in chunk.columns:
        value_counts[col].append(chunk[col].value_counts(dropna=False))
for col in chunk.columns:      
    comb = pd.concat(value_counts[col])
    comb = comb.groupby(comb.index).sum().sort_values()
    value_counts[col] = comb

In [12]:
print("Unique values:")
for key in value_counts:
    print("-", key, ":", len(value_counts[key]))

Unique values:
- company_permalink : 11573
- company_name : 11573
- company_category_code : 43
- company_country_code : 2
- company_state_code : 50
- company_region : 546
- company_city : 1229
- investor_permalink : 10552
- investor_name : 10465
- investor_category_code : 33
- investor_country_code : 72
- investor_state_code : 50
- investor_region : 585
- investor_city : 990
- funding_round_type : 9
- funded_at : 2808
- funded_month : 192
- funded_quarter : 72
- funded_year : 20
- raised_amount_usd : 1458


In [13]:
print("Percentage of unique values:")
for key in value_counts:
    valid_rows = value_counts[key].sum()
    print("-", key, ":", "{:.2f}%".format(len(value_counts[key])/valid_rows*100))

Percentage of unique values:
- company_permalink : 21.89%
- company_name : 21.89%
- company_category_code : 0.08%
- company_country_code : 0.00%
- company_state_code : 0.10%
- company_region : 1.03%
- company_city : 2.35%
- investor_permalink : 19.96%
- investor_name : 19.79%
- investor_category_code : 1.35%
- investor_country_code : 0.18%
- investor_state_code : 0.14%
- investor_region : 1.11%
- investor_city : 2.45%
- funding_round_type : 0.02%
- funded_at : 5.31%
- funded_month : 0.36%
- funded_quarter : 0.14%
- funded_year : 0.04%
- raised_amount_usd : 2.96%


We will drop columns that aren't useful for analysis:
- Links
- Too many missing values (>50%)
- Redundand information

In [14]:
drop_cols = ["investor_permalink","company_permalink","investor_category_code", "funded_month", "funded_quarter", "funded_year"]
keep_cols = chunk.columns.drop(drop_cols)
keep_cols

Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at',
       'raised_amount_usd'],
      dtype='object')

## Selecting Data Types

Now that we have a good sense of the missing values, let's get familiar with the column types before adding the data into SQLite.

After taking a look to the data:
- company_name : TEXT
- company_category_code : TEXT -> Category
- company_country_code : TEXT -> Category
- company_state_code : TEXT -> Category
- company_region : TEXT -> Category
- company_city : TEXT -> Category
- investor_name : TEXT
- investor_country_code : TEXT -> Category
- investor_state_code : TEXT -> Category
- investor_region : TEXT -> Category
- investor_city : TEXT -> Category
- funding_round_type : TEXT -> Category
- funded_at : DATE
- raised_amount_usd : FLOAT

In [15]:
data_types = {
    "company_category_code": "category",
    "company_country_code": "category",
    "company_state_code": "category",
    "company_region": "category",
    "company_city": "category",
    "investor_country_code": "category",
    "investor_state_code": "category",
    "investor_region": "category",
    "investor_city": "category",
    "funding_round_type": "category"
}

chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1',
                         usecols=keep_cols, dtype=data_types, parse_dates=["funded_at"])

new_memory = []
for chunk in chunk_iter:
    new_memory.append(chunk.memory_usage(deep=True).sum()/2**20)
print("Memory Usage per chunk:", new_memory)
print("Min Memory Usage per chunk:", min(new_memory))
print("Max Memory Usage per chunk:", max(new_memory))
total_new_memory = sum(new_memory)
print("\nTotal Memory:",total_new_memory)

Memory Usage per chunk: [0.959895133972168, 0.9593954086303711, 0.9572210311889648, 0.9300775527954102, 0.9590654373168945, 0.9625625610351562, 0.9615554809570312, 0.9449682235717773, 0.9200410842895508, 0.8319911956787109, 0.4919147491455078]
Min Memory Usage per chunk: 0.4919147491455078
Max Memory Usage per chunk: 0.9625625610351562

Total Memory: 9.878687858581543


In [16]:
chunk.dtypes

company_name                     object
company_category_code          category
company_country_code           category
company_state_code             category
company_region                 category
company_city                   category
investor_name                    object
investor_country_code          category
investor_state_code            category
investor_region                category
investor_city                  category
funding_round_type             category
funded_at                datetime64[ns]
raised_amount_usd               float64
dtype: object

Deleting some of the columns and modifying the type of others, **we could lower the required memory from 57Mb to 9.9Mb**

## Loading Chunks Into SQLite

Now we're in good shape to start exploring and analyzing the data. The next step is to load each chunk into a table in a SQLite database so we can query the full data set.

In [17]:
conn = sqlite3.connect("my_datasets/crunchbase.db")
cur = conn.cursor()
cur.execute("DELETE FROM investments")
conn.commit()

chunk_iter = pd.read_csv("my_datasets/crunchbase-investments.csv", chunksize=5000, encoding='ISO-8859-1',
                         usecols=keep_cols, dtype=data_types, parse_dates=["funded_at"])

for chunk in chunk_iter:
    chunk.to_sql("investments", conn, if_exists='append', index=False)

In [18]:
q = """
SELECT *
FROM investments
LIMIT 5
"""
test = pd.read_sql(q,conn, parse_dates=["funded_at"])
test = test.astype(data_types)
print(test.head())
print(test.info())

  company_name company_category_code company_country_code company_state_code  \
0     AdverCar           advertising                  USA                 CA   
1   LaunchGram                  news                  USA                 CA   
2         uTaP             messaging                  USA                NaN   
3     ZoopShop              software                  USA                 OH   
4     eFuneral                   web                  USA                 OH   

          company_region   company_city      investor_name  \
0                 SF Bay  San Francisco  1-800-FLOWERS.COM   
1                 SF Bay  Mountain View        10Xelerator   
2  United States - Other            NaN        10Xelerator   
3               Columbus       columbus        10Xelerator   
4              Cleveland      Cleveland        10Xelerator   

  investor_country_code investor_state_code investor_region investor_city  \
0                   USA                  NY        New York      New 

In [19]:
conn.close()

We could store the CSV into a SQLite DB to later parse the information with read_sql().

However, data types are not automatically set after running the queries and must be manually changed with df.astype().

## Data Exploration and Analysis

Now that the data is in SQLite, we can use the pandas SQLite workflow we learned in the last mission to explore and analyze startup investments. Remember that each row isn't a unique company, but a unique investment from a single investor. This means that many startups will span multiple rows.

- What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised?

    There are **11573** different companies. The top 10% corresponds to the best 1157 companies.

In [20]:
conn = sqlite3.connect("my_datasets/crunchbase.db")

# Top10% of companies
q = """
SELECT company_name, SUM(raised_amount_usd)/1000000 AS total_received_millions
FROM investments
GROUP BY company_name
ORDER BY total_received_millions DESC
LIMIT 1157
"""
top10 = pd.read_sql(q,conn, parse_dates=["funded_at"])

# Total investment in the database
q = """
SELECT SUM(raised_amount_usd)/1000000
FROM investments
"""
total_inv_millions = pd.read_sql(q,conn, parse_dates=["funded_at"]).iloc[0,0]
conn.close()

In [21]:
#Proportion Top10%
prop_top10 = top10["total_received_millions"].sum()/total_inv_millions *100
print("The top10% of companies got the {:.2f}% of all the investments.".format(prop_top10))

The top10% of companies got the 67.13% of all the investments.


In [22]:
#Proportion Top1% -> 115 companies
prop_top1 = top10["total_received_millions"].iloc[:115].sum()/total_inv_millions *100
print("The top1% of companies got the {:.2f}% of all the investments.".format(prop_top1))

The top1% of companies got the 26.22% of all the investments.


    Bottom 10% and Bottom 1% are practically 0%.

- Which category of company attracted the most investments?

In [23]:
conn = sqlite3.connect("my_datasets/crunchbase.db")

# Most attractive company category
q = """
SELECT company_category_code AS category, SUM(raised_amount_usd)/1000000 AS total_received_millions
FROM investments
GROUP BY company_category_code
ORDER BY total_received_millions DESC
"""
best_categories = pd.read_sql(q,conn, parse_dates=["funded_at"])

conn.close()

In [24]:
best_categories.head(3)

Unnamed: 0,category,total_received_millions
0,biotech,110396.423062
1,software,73084.516724
2,mobile,64777.379752


    The Top-3 most-attractive categories are:
        1. Biotech, with 110396 million dollars
        2. Software, with 73084 million dollars
        3. Mobile, with 64777 million dollars
        
- Which investor contributed the most money (across all startups)?

In [25]:
conn = sqlite3.connect("my_datasets/crunchbase.db")

# Top 3 investors
q = """
SELECT investor_name AS investor, SUM(raised_amount_usd)/1000000 AS total_invested_millions
FROM investments
GROUP BY investor_name
ORDER BY total_invested_millions DESC
LIMIT 10
"""
best_investors = pd.read_sql(q,conn, parse_dates=["funded_at"])

conn.close()

In [26]:
best_investors.head(3)

Unnamed: 0,investor,total_invested_millions
0,Kleiner Perkins Caufield & Byers,11217.826376
1,New Enterprise Associates,9692.542344
2,Accel Partners,6472.126199


    The Top-3 investors are:
        1. Kleiner Perkins Caufield & Byers, with 11217 million dollars
        2. New Enterprise Associates, with 9692 million dollar
        3. Accel Partners, with 6472 million dollar
        
- Which investors contributed the most money per startup?

    Let's consider average investment / startup

In [27]:
conn = sqlite3.connect("my_datasets/crunchbase.db")

# Top 3 investors per startup
q = """
SELECT investor_name AS investor, AVG(raised_amount_usd)/1000000 AS average_invested_millions, COUNT(*) AS number_of_companies
FROM investments
GROUP BY investor_name
ORDER BY average_invested_millions DESC
LIMIT 10
"""
best_investors_per_startup = pd.read_sql(q,conn, parse_dates=["funded_at"])

conn.close()

In [28]:
best_investors_per_startup.head(3)

Unnamed: 0,investor,average_invested_millions,number_of_companies
0,Marlin Equity Partners,2600.0,1
1,BrightHouse,2350.0,2
2,GI Partners,1050.0,1


    The Top-3 investors per startup are:
        1. Marlin Equity Partners, with an average of 2600 million dollars per startup (investment in 1 company)
        2. BrightHouse, with an average of 2350 million dollars per startup (investment in 2 company)
        3. GI Partners, with an average of 1050 million dollars per startup (investment in 1 company)
        
- Which funding round was the most popular? Which was the least popular?

In [29]:
conn = sqlite3.connect("my_datasets/crunchbase.db")

# Funding rounds analysis
q = """
SELECT funding_round_type AS funding_round, COUNT(*) AS number_of_companies, 
    SUM(raised_amount_usd)/1000000 AS total_invested_millions, AVG(raised_amount_usd)/1000000 AS average_invested_millions
FROM investments
GROUP BY funding_round_type
ORDER BY number_of_companies DESC
"""
funding_round_info = pd.read_sql(q,conn, parse_dates=["funded_at"])

conn.close()

In [30]:
funding_round_info = funding_round_info[funding_round_info["funding_round"].notnull()]
funding_round_info

Unnamed: 0,funding_round,number_of_companies,total_invested_millions,average_invested_millions
0,series-a,13938,86542.150833,6.469474
1,series-c+,10870,265753.464207,24.689099
2,angel,8989,4962.075061,0.690136
3,venture,8917,130556.496419,16.256568
4,series-b,8794,128326.776084,14.869847
5,other,964,18507.257968,19.815051
6,private-equity,357,16159.875901,51.794474
7,post-ipo,33,30917.6,1066.124138
8,crowdfunding,5,6.4915,1.622875


    The Top-3 of popular funding rounds are:
        1. 'series-a' with a total of 13938 investments
        2. 'series-c+' with a total of 10870 investments
        3. 'angel' with a total of 8989 investments
        
    The Bottom-3 of popular funding rounds are:
        1. 'crowdfunding' with a total of 5 investments
        2. 'post-ipo' with a total of 33 investments
        3. 'private-equity' with a total of 357 investments
        
    However, the information shows that number of investments is not directly related to the total amount of money (or average amount of money) acquired by the funding round type.