# 💸 Analyzing Crunchbase Data on funding rounds

This is a dataquest project focusing on practicing optimisation of dataset size by using appropriate datatypes in pandas and analyzing this data with an approach that combines SQL and Pandas.

The dataset we will be looking at comes from Crunchbase and it contains crowdsources information on the fundraising rounds of many startups. Our dataset contains data up to the year of 2013.

During this project we will:
- Perform a short data discovery
- Memory usage optimization
- Loading and analysis of the data using SQL and Pandas

## Data discovery

In this first step, we want to look at a sample of our data to become familiar with its contents and possible limitations.

Imposed constraint for this project is a memory limit of 10MB so we will need to make sure the dataset/chunks won't exceed it. Our ultimate goal is to reduce the dataset size below 10MB so that we can work with it as a whole, however this is only possible because we are working with a pretty small dataset. In production settings this would not happen that often.

The aim of this part is to:
- Determine which columns contain redundant information and can be omitted
- Determine chunk size that won't exceed 50% of our memory limits = 5MB
- Get the starting memory footprint once loaded as a dataframe
- Get the datatypes of the various columns
- Get the amount of NaN and unique values in each column
- Iterate on point one and see whether more columns could be excluded

Let's first examine a sample of the data to see whether we need all of the columns for the analysis as well to determine the size of chunks for reading of the dataset. Our memory limit is 10MB, therefore we are looking for chunks of around 5MB in size.

In [1]:
import pandas as pd

print(pd.read_csv("crunchbase-investments.csv", nrows = 5, encoding = "ISO-8859-1"))

     company_permalink company_name company_category_code  \
0    /company/advercar     AdverCar           advertising   
1  /company/launchgram   LaunchGram                  news   
2        /company/utap         uTaP             messaging   
3    /company/zoopshop     ZoopShop              software   
4    /company/efuneral     eFuneral                   web   

  company_country_code company_state_code         company_region  \
0                  USA                 CA                 SF Bay   
1                  USA                 CA                 SF Bay   
2                  USA                NaN  United States - Other   
3                  USA                 OH               Columbus   
4                  USA                 OH              Cleveland   

    company_city          investor_permalink      investor_name  \
0  San Francisco  /company/1-800-flowers-com  1-800-FLOWERS.COM   
1  Mountain View        /company/10xelerator        10Xelerator   
2            NaN       

It looks like the company_permalink, invetor_permalink, funded_month, funded_quarter and funded_year columns contain redundant information so let's drop those columns from our analysis.

Now, let's determine the chunksize that appears to be appropriate.

In [144]:
columns_to_use = ["company_name", "company_category_code", "company_country_code", "company_state_code", "company_region", "company_city", "investor_name", "investor_category_code", "investor_country_code", "investor_state_code", "investor_region", "investor_city", "funding_round_type", "funded_at", "raised_amount_usd" ]
print(pd.read_csv("crunchbase-investments.csv", nrows = 4000, encoding = "ISO-8859-1").memory_usage(deep = True).sum()/(1024*1024))
print(pd.read_csv("crunchbase-investments.csv", nrows = 6000, encoding = "ISO-8859-1", usecols = columns_to_use).memory_usage(deep = True).sum()/(1024*1024))

4.448305130004883
4.990866661071777


So we can see that already, just by removing the redundant columns we have been able to use larger chunks of data which will speed up all of the operations we will do. Some of these columns, especially the date ones might be beneficial if you perform a lot of operations on them but otherwise you can get the information from the date column alone. Ideally you would be using a date table within a database and join these two tables together.

Now that we have a chunksize that seems appropriate, let's check that we won't surpass our 5MB limit accross the whole dataset. To do this, we will make a function to check the dataset's memory usage per chunk and per whole dataset as we'll repeat this calculation multiple times during optimization.

In [3]:
# This function takes in the name of the csv you want to size in the form a string as "filename.csv", 
# the chunksize you have determined as appropriate previously, the encoding of the file, a list of columns 
# you want to keep, if left empty all columns will be read, a dictionary of datatypes for each column and 
# finally the columns that should be parsed as dates. Check the pandas.read_csv documentation for examples 
# of inputs. Finally there is an option to either display the size of each chunk or just the sum, in MB. 
# No return value.

def get_memory_usage(csv, size_of_chunk, file_encoding, columns = [], detail = False, data_types = {}, to_date = []):
    import pandas as pd
    chunk_iter = pd.read_csv(csv, chunksize = size_of_chunk, encoding = file_encoding, usecols = columns, dtype = data_types, parse_dates = to_date)
    
    used_memory = []
    total_memory = 0
    
    for chunk in chunk_iter:
        chunk_memory = chunk.memory_usage(deep = True).sum()/(1024*1024)
        used_memory.append(chunk_memory)
        total_memory += chunk_memory
    
    if detail:
        i = 1
        for chunk in used_memory:
            print(i, " : ", chunk)
            i += 1
    print("Total memory used:", total_memory)

In [4]:
get_memory_usage("crunchbase-investments.csv", 6000, "ISO-8859-1", columns = columns_to_use, detail = True )

1  :  4.990866661071777
2  :  4.851469993591309
3  :  4.850440979003906
4  :  4.835270881652832
5  :  4.867792129516602
6  :  4.846621513366699
7  :  4.840594291687012
8  :  4.478250503540039
9  :  3.1795654296875
Total memory used: 41.740872383117676


It looks like 6000 is an appropriate chunksize as all of the chunks are under 5MB. We have also learned that our dataset is split into 9 chunks and all together the size of the created dataframe is almost 42MB.

Now let's dig deeper into the columns. We will get the data types of each column as well as the size that it occupies so that we can determine which columns have wrong data types and which should be our focus if we want to reduce the memory footprint of our dataframe.

It is possible that datatypes will vary between chunks which can point us to some uncleaned data so we will list all detected datatypes accross chunks for each column.

In [5]:
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 6000, encoding = "ISO-8859-1", usecols = columns_to_use)

column_sizes = {}
column_dtypes = {}

for chunk in chunk_iter:
    columns = chunk.columns
    for column in chunk:
        column_type = chunk[column].dtype
        column_memory = chunk[column].memory_usage(deep = True)/(1024*1024)
        if column in column_sizes:
            column_sizes[column] += column_memory
            if column_type not in column_dtypes[column]:
                column_dtypes[column].append(column_type)
        else:
            column_sizes[column] = column_memory
            column_dtypes[column] = [column_type]
            
for column in column_sizes:
    print(column, ": \n Size = ", column_sizes[column], "MB \n Data types = ", column_dtypes[column])

company_name : 
 Size =  3.426084518432617 MB 
 Data types =  [dtype('O')]
company_category_code : 
 Size =  3.2637481689453125 MB 
 Data types =  [dtype('O')]
company_country_code : 
 Size =  3.0263519287109375 MB 
 Data types =  [dtype('O')]
company_state_code : 
 Size =  2.963290214538574 MB 
 Data types =  [dtype('O')]
company_region : 
 Size =  3.254631996154785 MB 
 Data types =  [dtype('O')]
company_city : 
 Size =  3.344602584838867 MB 
 Data types =  [dtype('O')]
investor_name : 
 Size =  3.7353992462158203 MB 
 Data types =  [dtype('O')]
investor_category_code : 
 Size =  0.6176071166992188 MB 
 Data types =  [dtype('O'), dtype('float64')]
investor_country_code : 
 Size =  2.5944480895996094 MB 
 Data types =  [dtype('O'), dtype('float64')]
investor_state_code : 
 Size =  2.4316701889038086 MB 
 Data types =  [dtype('O'), dtype('float64')]
investor_region : 
 Size =  3.24007511138916 MB 
 Data types =  [dtype('O')]
investor_city : 
 Size =  2.821223258972168 MB 
 Data types =

We can see that there are multiple columns that have multiple data types, since it is the case for columns that based on column name should be strings(object), the float type detected in some chunks is probably due to NaN value. We will check that later. We can also see that the biggest columns in terms of size that we should try to shrink should be string columns, our option here is to convert them to category datatypes, but this will work only for columns with a reasonable amount of unique values.

So let's make a function that will help us:
a) confirm our suspicion about NaN values by showing the number and share of NaN values per column
b) determine which columns contain a manageable number of unique values and could be converted to category data type

In [145]:
# This function takes in the name of the csv you want to analyze, the files encoding, a list of columns 
# to include in the analysis and a "long" flag that if True will print out full frequency table. If false
# only summary statistics will be returned such as the number and share of NaN values and the number of unique
# values. No return value.
def get_value_frequency_and_empty_per_column(csv, size_of_chunk, file_encoding, column, long = True):
    import pandas as pd
    frequencies = []
    column_list = [column]
    empty_values = 0
    rows = 0
    chunk_iter = pd.read_csv(csv, chunksize = size_of_chunk, encoding = file_encoding, usecols = column_list)
    for chunk in chunk_iter:
        frequencies.append(chunk[column].value_counts())
        empty_values += chunk[column].isna().sum()
        rows += chunk.shape[0]
    frequencies_sum = pd.concat(frequencies)
    frequencies_sum = frequencies_sum.groupby(frequencies_sum.index).sum()
    unique_values = frequencies_sum.shape[0]
    if long:
        print(frequencies_sum)
    print(column, "\n NA values:", empty_values, "\n % of empty:",empty_values/rows*100, "\n unique values:", unique_values)

columns = pd.read_csv("crunchbase-investments.csv", nrows = 1, encoding = "ISO-8859-1", usecols = columns_to_use).columns  

for column in columns:
    get_value_frequency_and_empty_per_column("crunchbase-investments.csv", 6000, "ISO-8859-1",column, long = False)

company_name 
 NA values: 1 
 % of empty: 0.0018914318138831094 
 unique values: 11573
company_category_code 
 NA values: 643 
 % of empty: 1.2161906563268394 
 unique values: 43
company_country_code 
 NA values: 1 
 % of empty: 0.0018914318138831094 
 unique values: 2
company_state_code 
 NA values: 492 
 % of empty: 0.9305844524304898 
 unique values: 50
company_region 
 NA values: 1 
 % of empty: 0.0018914318138831094 
 unique values: 546
company_city 
 NA values: 533 
 % of empty: 1.0081331567996974 
 unique values: 1229
investor_name 
 NA values: 2 
 % of empty: 0.003782863627766219 
 unique values: 10465
investor_category_code 
 NA values: 50427 
 % of empty: 95.37923207868356 
 unique values: 33
investor_country_code 
 NA values: 12001 
 % of empty: 22.699073198411195 
 unique values: 72
investor_state_code 
 NA values: 16809 
 % of empty: 31.793077359561188 
 unique values: 50
investor_region 
 NA values: 2 
 % of empty: 0.003782863627766219 
 unique values: 585
investor_city 


Based on the above we will exclude investor_category_code from our analysis since more than 95% of the rows don't contain this information so it wouldn't be useful for deducing anything about the whole dataset.

And looking at the NaN values it looks like in most cases the conversion to float comes from these NaN values which for some columns are due to the fact that most of the data is from the US and most countries apart from the US don't have states within them, therefore the state code will be NaN by default. 

However in the case of other columns there is still a huge % of empty fields which points to the incompleteness of the dataset.

## Data size optimization

Now that we have a better understanding of the dataset let's look at how we can optimize each columns datatype for a more efficient data storage.

For object (string) columns we will attempt at using the category datatype where possible and since our only numeric value has NaN values we will keep it in the form of a float as integers don't accept NaN.

Looking at the output from our function above the columns that are suited to become category datatype are:
- company_category_code 
- company_country_code
- company_state_code
- company_region
- company_city
- investor_country_code
- investor_state_code
- investor_region
- investor_city
- funding_round_type

We also have one date column in there so let's convert that from object to date data type as well. 

So let's look at the optimized size after applying these changes:

In [14]:
columns_to_use = ["company_name", "company_category_code", "company_country_code", "company_state_code", "company_region", "company_city", "investor_name", "investor_country_code", "investor_state_code", "investor_region", "investor_city", "funding_round_type", "funded_at", "raised_amount_usd" ]
columns_dts = {
    "company_name" : "object",
    "company_category_code" : "category",
    "company_country_code" : "category",
    "company_state_code" : "category",
    "company_region" : "category",
    "company_city" : "category",
    "investor_name" : "object",
    "investor_country_code" : "category",
    "investor_state_code" : "category",
    "investor_region" : "category",
    "investor_city" : "category",
    "funding_round_type" : "category",
    "raised_amount_usd" : "float64"
}

get_memory_usage("crunchbase-investments.csv", 6000, "ISO-8859-1", columns = columns_to_use, data_types = columns_dts, to_date = ["funded_at"], detail = True )



1  :  1.1286096572875977
2  :  1.1288862228393555
3  :  1.1041269302368164
4  :  1.1190757751464844
5  :  1.1292734146118164
6  :  1.121281623840332
7  :  1.1095781326293945
8  :  1.039377212524414
9  :  0.809748649597168
Total memory used: 9.689957618713379


We have managed to cut the size of the dataset to under 10MB so we can increase our chunk size and improve the speed of all of our calculations. We could even load the whole dataset at once at we would still be within our memory limits but we will keep it to 2 chunks just to be safe.

In [20]:
get_memory_usage("crunchbase-investments.csv", 28000, "ISO-8859-1", columns = columns_to_use, data_types = columns_dts, to_date = ["funded_at"], detail = True )


1  :  4.909026145935059
2  :  4.270639419555664
Total memory used: 9.179665565490723


## Analysis with SQL

Now that our dataset is optimized we can start performing SQL calculations on it, we will use SQLite for this.

In [22]:
import sqlite3

conn = sqlite3.connect("crunchbase.db")
crunch_iter = pd.read_csv("crunchbase-investments.csv", chunksize=28000, encoding = "ISO-8859-1", usecols = columns_to_use, dtype = columns_dts, parse_dates = ["funded_at"])
for chunk in crunch_iter:    
    chunk.to_sql("investments", conn, if_exists='append', index=False)

Let's check that our datatypes are correct even after conversion to SQLite datatypes. SQLite doesn't support category datatype, therefore it should be visible as "TEXT".

In [23]:
datatypes = pd.read_sql("PRAGMA table_info(investments)", conn)

print(datatypes)

    cid                   name       type  notnull dflt_value  pk
0     0           company_name       TEXT        0       None   0
1     1  company_category_code       TEXT        0       None   0
2     2   company_country_code       TEXT        0       None   0
3     3     company_state_code       TEXT        0       None   0
4     4         company_region       TEXT        0       None   0
5     5           company_city       TEXT        0       None   0
6     6          investor_name       TEXT        0       None   0
7     7  investor_country_code       TEXT        0       None   0
8     8    investor_state_code       TEXT        0       None   0
9     9        investor_region       TEXT        0       None   0
10   10          investor_city       TEXT        0       None   0
11   11     funding_round_type       TEXT        0       None   0
12   12              funded_at  TIMESTAMP        0       None   0
13   13      raised_amount_usd       REAL        0       None   0


We are now going to respond to some questions regarding the dataset by using SQL to select columns and then we'll use pandas for all calculations and grouping as it is usually more efficient. For the first calculation we'll show how to leverage more SQL instead of Pandas.

The questions are:

- What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
- Which category of company attracted the most investments?
- Which investor contributed the most money (across all startups)?
- Which investors contributed the most money per startup?
- Which funding round was the most popular? Which was the least popular?

In [113]:
data_query = "SELECT company_name, SUM(raised_amount_usd) as raised_sum FROM investments GROUP BY company_name "
data = pd.read_sql(data_query, conn)

# print(data.head(5))
# We have companies that have NaN in the database and since they don't have a valid investment in the dataset
# we will exclude them. We will also sort the data descending.

data = data.dropna(subset = "raised_sum").sort_values(by = "raised_sum", ascending = False)
# print("\n",data.head(5), "\n")

companies_funded_count = data.shape[0]
total_funds = data["raised_sum"].sum()

top10 = round(companies_funded_count*0.1) # rounding down because we cannot split a company in half
top1 = round(companies_funded_count*0.01)

funds_top10 = data.head(top10)["raised_sum"].sum()/total_funds
funds_top1 = data.head(top1)["raised_sum"].sum()/total_funds

funds_bottom10 = data.tail(top10)["raised_sum"].sum()/total_funds
funds_bottom1 = data.tail(top1)["raised_sum"].sum()/total_funds


print(
    "Top 10% of companies contributed to total by:", "{:.2%}".format(funds_top10),
    "\nTop 1% of companies contributed to total by:", "{:.2%}".format(funds_top1), 
    "\nBottom 10% of companies contributed to total by:", "{:.2%}".format(funds_bottom10), 
    "\nBottom 1% of companies contributed to total by:", "{:.4%}".format(funds_bottom1) )

Top 10% of companies contributed to total by: 64.56% 
Top 1% of companies contributed to total by: 25.14% 
Bottom 10% of companies contributed to total by: 0.03% 
Bottom 1% of companies contributed to total by: 0.0002%


In [114]:
data_query = "SELECT company_category_code, raised_amount_usd FROM investments"
data = pd.read_sql(data_query, conn)

# grouping by company category code and performing a sum, also converting to millions USD for better readability

category_sum = (data.groupby("company_category_code", dropna = False).sum()/1000000).sort_values(by = "raised_amount_usd", ascending = False)

# print(category_sum, "\n")

print(category_sum.head(1).first_valid_index().capitalize(), 
      "category, has raised the most money.\nMoney raised =",
    category_sum.head(1).iloc[0,0],
     "\n% of total =",
     "{:.2%}".format(category_sum.head(1).iloc[0,0]/(total_funds/1000000)))

Biotech category, has raised the most money.
Money raised = 110396.423062 
% of total = 16.19%


In [115]:
data_query = "SELECT investor_name, raised_amount_usd FROM investments"
data = pd.read_sql(data_query, conn)

investor_sum = (data.groupby("investor_name", dropna = False).sum()/1000000).sort_values(by = "raised_amount_usd", ascending = False)

# print(investor_sum, "\n")

print(investor_sum.head(1).first_valid_index().capitalize(), 
      ", has invested the most money.\nMoney invested =",
    investor_sum.head(1).iloc[0,0],
     "\n% of total =",
     "{:.2%}".format(investor_sum.head(1).iloc[0,0]/(total_funds/1000000)))

Kleiner perkins caufield & byers , has invested the most money.
Money invested = 11217.826376 
% of total = 1.65%


In [121]:
data_query = "SELECT investor_name, company_name, raised_amount_usd FROM investments"
data = pd.read_sql(data_query, conn)

# summing up total investments per investor per startup and getting the highest one, putting it into a list
# this list will be later added

investor_sum = (data.groupby(["investor_name", "company_name"], dropna = False).sum()/1000000).sort_values(by = "investor_name", ascending = False)
investor_max = investor_sum.groupby(["investor_name"], dropna = False).max().sort_values(by = "raised_amount_usd", ascending = False)

print(investor_max.head(5), "\n")
print("The investors that raised the most money per startup were/was:",
     investor_max[investor_max["raised_amount_usd"] == investor_max.iloc[0,0]].index.to_list())

               raised_amount_usd
investor_name                   
Intel                     5620.0
Comcast                   5620.0
Time Warner               5620.0
BrightHouse               4700.0
Google                    3200.0 

The investors that raised the most money per startup were/was: ['Intel', 'Comcast', 'Time Warner']


In [118]:
data_query = "SELECT funded_at, raised_amount_usd FROM investments"
data = pd.read_sql(data_query, conn)

# The popularity of a funding round will be defined as the funding round during which, the biggest amount of 
# investments happened not by the sum of funds raised.

investments_count = (data.groupby("funded_at").count()).sort_values(by = "raised_amount_usd", ascending = False)
investments_count_not0 = investments_count[investments_count["raised_amount_usd"]>0]
investments_count_smallest = investments_count[investments_count["raised_amount_usd"] == investments_count_not0.iloc[-1,0]]

# print(investments_count)
# print(investments_count_smallest)

print("The most popular funding round (with most investments) hapenned on:",
     investments_count.first_valid_index(),
     "\nThere wasn't one specifc round with the least amount of investements, but the smallest number of investments per round was ",
     investments_count_not0.iloc[-1,0],
     "and there were",
     investments_count_smallest.shape[0],
     "of them.")

The most popular funding round (with most investments) hapenned on: 2008-01-01 00:00:00 
There wasn't one specifc round with the least amount of investements, but the smallest number of investments per round was  1 and there were 208 of them.
