<h1>
<center>
Dataquest Guided Project 21:
Analyzing Startup Fundraising Deals from Crunchbase
</center>
</h1>

## Introduction

This is part of the Dataquest program.

- part of paths **Data Engineer**
    - Step 2: **Handling Large Data Sets in Python**
        - Course 1 :  **Processing Large Data sets in Pandas **
            - Optimizing Dataframe Memory Footprint
            - Processing Dataframes in Chunks
            - Augmenting Pandas with SQLite
       
As this is a guided project, we are following and deepening the steps suggested by Dataquest. In this project, we will practise working with large datasets in pandas.

## Use case : Analyzing Startup Fundraising Deals from Crunchbase

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.

The data set of investments we'll be exploring is current as of October 2013. You can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, we know that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

## Read the data

In [1]:
import pandas as pd
pd.options.display.max_columns = 99
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

We want to check, across all of the chunks, each column's missing value counts, each column's memory footprint, the total memory footprint of all the chunk combined and which column we can drop because they are not useful for analysis.

In [2]:
import pandas as pd
pd.options.display.max_columns = 99
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

### Missing Values

In [3]:
mv_list = []
for chunk in chunk_iter:
    mv_list.append(chunk.isnull().sum())
    
combined_mv = pd.concat(mv_list)
unique_combined_mv_vc = combined_mv.groupby(combined_mv.index).sum()
unique_combined_mv_vc.sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

### Memory footprint for each column

In [4]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
counter = 0
series_memory_fp = pd.Series()
for chunk in chunk_iter:
    if counter == 0:
        series_memory_fp = chunk.memory_usage(deep=True)
    else:
        series_memory_fp += chunk.memory_usage(deep=True)
    counter += 1
series_memory_fp

Index                         920
company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411585
company_city              3505926
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64

### Total memory footprint of all the chunk combined

In [5]:
series_memory_fp.sum() / (1024 * 1024)

56.988484382629395

### Columns to drop

In [6]:
# Drop columns representing URL's or containing way too many missing values (>90% missing)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)

In [7]:
keep_cols.tolist

<bound method Index.tolist of Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at', 'funded_month',
       'funded_quarter', 'funded_year', 'raised_amount_usd'],
      dtype='object')>

## Selecting Data Types

Let's first determine which columns shift types across chunks. 

In [8]:
# Key: Column name, Value: List of types
col_types = {}
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))

In [9]:
uniq_col_types = {}
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
uniq_col_types

{'company_category_code': {'object'},
 'company_city': {'object'},
 'company_country_code': {'object'},
 'company_name': {'object'},
 'company_region': {'object'},
 'company_state_code': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'funding_round_type': {'object'},
 'investor_city': {'float64', 'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_name': {'object'},
 'investor_region': {'object'},
 'investor_state_code': {'float64', 'object'},
 'raised_amount_usd': {'float64'}}

In [10]:
convert_col_dtypes = {
    "company_category_code": "category",
    "company_country_code" : "category",
    "company_state_code" : "category",
    "funded_quarter" : "category",
    "funding_round_type": "category",
    "investor_country_code": "category",
    "investor_state_code": "category"   
}

In [11]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, dtype=convert_col_dtypes, 
                         parse_dates=["funded_at", "funded_month", "funded_year"],
                         encoding='ISO-8859-1')
counter = 0
series_memory_fp = pd.Series()
for chunk in chunk_iter:
    if counter == 0:
        series_memory_fp = chunk.memory_usage(deep=True)
    else:
        series_memory_fp += chunk.memory_usage(deep=True)
    counter += 1
series_memory_fp

Index                         920
company_permalink         4057788
company_name              3591326
company_category_code       96448
company_country_code        53674
company_state_code          96101
company_region            3411585
company_city              3505926
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code       82990
investor_state_code         83683
investor_region           3396281
investor_city             2885083
funding_round_type          62126
funded_at                  422960
funded_month               422960
funded_quarter             123014
funded_year                422960
raised_amount_usd          422960
dtype: int64

In [12]:
series_memory_fp.sum() / (1024 * 1024)

31.144545555114746

We decreased the memory footprint from 56 megabytes to 31 megabytes.

## Loading Chunks into SQLite

In [36]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')

for chunk in chunk_iter:
    chunk.to_sql("investments", conn, if_exists='append', index=False)

In [42]:
conn = sqlite3.connect("crunchbase.db")
query = "PRAGMA table_info(investments);"
cur = conn.cursor()
cur.execute(query).fetchall()

[(0, 'company_permalink', 'TEXT', 0, None, 0),
 (1, 'company_name', 'TEXT', 0, None, 0),
 (2, 'company_category_code', 'TEXT', 0, None, 0),
 (3, 'company_country_code', 'TEXT', 0, None, 0),
 (4, 'company_state_code', 'TEXT', 0, None, 0),
 (5, 'company_region', 'TEXT', 0, None, 0),
 (6, 'company_city', 'TEXT', 0, None, 0),
 (7, 'investor_permalink', 'TEXT', 0, None, 0),
 (8, 'investor_name', 'TEXT', 0, None, 0),
 (9, 'investor_category_code', 'TEXT', 0, None, 0),
 (10, 'investor_country_code', 'TEXT', 0, None, 0),
 (11, 'investor_state_code', 'TEXT', 0, None, 0),
 (12, 'investor_region', 'TEXT', 0, None, 0),
 (13, 'investor_city', 'TEXT', 0, None, 0),
 (14, 'funding_round_type', 'TEXT', 0, None, 0),
 (15, 'funded_at', 'TEXT', 0, None, 0),
 (16, 'funded_month', 'TEXT', 0, None, 0),
 (17, 'funded_quarter', 'TEXT', 0, None, 0),
 (18, 'funded_year', 'INTEGER', 0, None, 0),
 (19, 'raised_amount_usd', 'REAL', 0, None, 0)]

## Data Exploration and Analysis

Which category of company attracted the most investments?

In [46]:
query = "SELECT company_category_code, SUM(raised_amount_usd) FROM investments GROUP BY company_category_code ORDER BY SUM(raised_amount_usd) DESC LIMIT 5"
cur = conn.cursor()
cur.execute(query).fetchall()

[('biotech', 110396423062.0),
 ('software', 73084516724.0),
 ('mobile', 64777379752.0),
 ('cleantech', 52705225028.0),
 ('enterprise', 45860927273.0)]

Which investor contributed the most money (accross all startups)? 

In [47]:
query = "SELECT investor_name, SUM(raised_amount_usd) FROM investments GROUP BY investor_name ORDER BY SUM(raised_amount_usd) DESC LIMIT 5"
cur = conn.cursor()
cur.execute(query).fetchall()

[('Kleiner Perkins Caufield & Byers', 11217826376.0),
 ('New Enterprise Associates', 9692542344.0),
 ('Accel Partners', 6472126199.0),
 ('Goldman Sachs', 6375459000.0),
 ('Sequoia Capital', 6039402410.0)]

Which funding round was the most popular? 

In [50]:
query = "SELECT funding_round_type, SUM(raised_amount_usd) FROM investments GROUP BY funding_round_type ORDER BY SUM(raised_amount_usd) DESC LIMIT 5"
cur = conn.cursor()
cur.execute(query).fetchall()

[('series-c+', 265753464207.0),
 ('venture', 130556496419.0),
 ('series-b', 128326776084.0),
 ('series-a', 86542150833.0),
 ('post-ipo', 30917600000.0)]

Which funding round was the least popular?

In [51]:
query = "SELECT funding_round_type, SUM(raised_amount_usd) FROM investments GROUP BY funding_round_type ORDER BY SUM(raised_amount_usd) ASC LIMIT 5"
cur = conn.cursor()
cur.execute(query).fetchall()

[(None, None),
 ('crowdfunding', 6491500.0),
 ('angel', 4962075061.0),
 ('private-equity', 16159875901.0),
 ('other', 18507257968.0)]