# Analyzing Startup Fundraising Deals From Crunchbase

In this guided project, we'll practice using some of the techniques we learned to analyze startup investments from Crunchbase.com.

Throughout this project, we'll practice working with different memory constraints. Let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, we know that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

In [14]:
import pandas as pd
pd.options.display.max_columns = 99
# read the data set into dataframes using 5,000 row chunks 
# to ensure that each chunk consumes much less than 10 megabytes 
# of memory
crunch_iter = pd.read_csv('crunchbase-investments.csv', 
                          chunksize=5000, encoding='ISO-8859-1')
                          
# Across all of the chunks, become familiar with:
# Each column's missing value counts
mv_list = []
for chunk in crunch_iter:
    mv_list.append(chunk.isnull().sum())
    
mv_series = pd.concat(mv_list)
mv_series_grouped = mv_series.groupby(mv_series.index).sum()
mv_series_grouped.sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

Let's now explore the memory footprint.

In [24]:
crunch_iter = pd.read_csv('crunchbase-investments.csv', 
                          chunksize=5000, encoding='ISO-8859-1')
                          
# Across all of the chunks, become familiar with:
# Each column's memory footprint
mem_list = []
for chunk in crunch_iter:
    mem_list.append(chunk.memory_usage(deep=True))
    
mem_series = pd.concat(mem_list)
mem_series_grouped = mem_series.groupby(mem_series.index).sum()
# We can drop the Index column 
# because it'll not be useful for analysis
mem_series_grouped = mem_series_grouped.drop('Index')
mem_total = mem_series_grouped.sum()/(2**20)
print(mem_series_grouped, mem_total, sep='\n\n\n')

company_category_code     3421104
company_city              3505926
company_country_code      3172176
company_name              3591326
company_permalink         4057788
company_region            3411585
company_state_code        3106051
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
funding_round_type        3410707
investor_category_code     622424
investor_city             2885083
investor_country_code     2647292
investor_name             3915666
investor_permalink        4980548
investor_region           3396281
investor_state_code       2476607
raised_amount_usd          422960
dtype: int64


56.9876070022583


In [25]:
# Drop columns representing URL's or containing 
# way too many missing values (>90% missing)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)
keep_cols.to_list()

['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'funded_month',
 'funded_quarter',
 'funded_year',
 'raised_amount_usd']

# Selecting Data Types

let's get familiar with the column types before adding the data into SQLite.

In [47]:
crunch_iter = pd.read_csv('crunchbase-investments.csv', 
                          chunksize=5000, encoding='ISO-8859-1', 
                          usecols=keep_cols)
crunch_dtypes = {}
for chunk in crunch_iter:
    for col in chunk.columns:
        if col not in crunch_dtypes:
            crunch_dtypes[col]= [str(chunk.dtypes[col])]
        else:
            crunch_dtypes[col].append(str(chunk.dtypes[col]))
        
print(crunch_dtypes['company_name'])

['object', 'object', 'object', 'object', 'object', 'object', 'object', 'object', 'object', 'object', 'object']


Let's now get the unique column datatypes.

In [45]:
unique_dtypes = {}
for key in crunch_dtypes:
    unique_dtypes[key] = set(crunch_dtypes[key])
    
unique_dtypes  

{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

In [46]:
chunk.head()

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,


# Loading Chunks Into SQLite

Now we're in good shape to start exploring and analyzing the data. The next step is to load each chunk into a table in a SQLite database so we can query the full data set.

In [49]:
import sqlite3
crunch_iter = pd.read_csv('crunchbase-investments.csv', 
                          chunksize=5000, encoding='ISO-8859-1', 
                          usecols=keep_cols)
conn = sqlite3.connect('crunchbase.db')

for chunk in crunch_iter:
    chunk.to_sql('investments', conn, if_exists='append', index=False)


# Next Steps

Now that the data is in SQLite, we can use the pandas SQLite workflow we learned in the last mission to explore and analyze startup investments. Remember that each row isn't a unique company, but a unique investment from a single investor. This means that many startups will span multiple rows.

Use the pandas SQLite workflow to answer the following questions:

- What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
- Which category of company attracted the most investments?
- Which investor contributed the most money (across all startups)?
- Which investors contributed the most money per startup?
- Which funding round was the most popular? Which was the least popular?

Here are some ideas for further exploration:

- Repeat the tasks in this guided project using stricter memory constraints (under 1 megabyte).
- Clean and analyze the other Crunchbase data sets from the [https://github.com/datahoarder/crunchbase-october-2013)[same GitHub repo].
        - Understand which columns the data sets share, and how the data sets are linked.
        - Create a relational database design that links the data sets together and reduces the overall disk space the database file consumes.
        - Use pandas to populate each table in the database, create the appropriate indexes, and so on.
