# ⚡⚡__<font color=red>Spark Funds</font>__⚡⚡

## Objectives
 

## Project Brief
You work for Spark Funds, an asset management company. Spark Funds wants to make investments in a few companies. The CEO of Spark Funds wants to understand the global trends in investments so that she can take the investment decisions effectively.

## Business and Data Understanding
Spark Funds has two minor constraints for investments:

It wants to invest between 5 to 15 million USD per round of investment.

It wants to invest only in English-speaking countries because of the ease of communication with the companies it would invest in.

## Strategy
Spark Funds wants to invest where most other investors are investing.

## Business objective
The business objectives and goals of data analysis are pretty straightforward.

__Business objective__ The objective is to identify the best sectors, countries, and a suitable investment type for making investments. The overall strategy is to invest where others are investing, implying that the 'best' sectors and countries are the ones 'where most investors are investing'.

__Goals of data analysis__ Your goals are divided into three sub-goals:

Investment type analysis: Comparing the typical investment amounts in the venture, seed, angel, private equity etc. so that Spark Funds can choose the type that is best suited for their strategy.
Country analysis: Identifying the countries which have been the most heavily invested in the past. These will be Spark Funds’ favourites as well.

Sector analysis: Understanding the distribution of investments across the eight main sectors. (Note that we are interested in the eight 'main sectors' provided in the mapping file. The two files — companies and rounds2 — have numerous sub-sector names; hence, you will need to map each sub-sector to its main sector.)


# Data Loading and Cleaning

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

###  Data Loading

In [None]:
# Setting working directory to required location
import os
print(os.listdir("../input"))

In [None]:
# reading data files
# using encoding = "ISO-8859-1" to avoid pandas encoding error
rounds = pd.DataFrame(pd.read_csv( "../input/rounds2.csv", encoding = "LATIN-1"))
rounds.head()

In [None]:
companies = pd.DataFrame(pd.read_csv("../input/companies.txt", sep="\t", encoding = "ISO-8859-1"))
companies.head()

### Data inspection

In [None]:
companies.shape

In [None]:
companies.info()

In [None]:
companies.describe()

In [None]:
# inspect the structure 
rounds.shape

In [None]:
rounds.info()

In [None]:
rounds.describe()

### Data Cleaning

Ideally, the permalink column in the companies dataframe should be the unique_key of the table, having 66368 unique company names (links, or permalinks).<br>
Also, these 66368 companies should be present in the rounds file.<br>
Let's first confirm that these 66368 permalinks (which are the URL paths of companies' websites) are not repeating in the column, i.e. they are unique.<br>
Also, let's convert all the entries to lowercase (or uppercase) for uniformity.

In [None]:
# converting all permalinks to lowercase
companies['permalink'] = companies['permalink'].str.lower()
len(companies.permalink.unique())

Thus, there are 66368 unique companies in the table and permalink is the unique primary key. Each row represents a unique company.

Let's now check whether all of these 66368 companies are present in the rounds file, and if some extra ones are present

In [None]:
# converting column to lowercase
rounds['company_permalink'] = rounds['company_permalink'].str.lower()
len(rounds.company_permalink.unique())

There seem to be 2 extra permalinks in the rounds file which are not present in the companies file. Let's hope that this is a data quality issue, since if this were genuine, we have two companies whose investment round details are available but their metadata (company name, sector etc.) is not available in the companies table.

Let's have a look at the company permalinks which are in the 'rounds' file but not in 'companies'.
    

In [None]:
# companies present in companies df but not in rounds df
companies.loc[~companies['permalink'].isin(rounds['company_permalink']), :]

In [None]:
# Thus, the companies df also contains special characters. Let's treat those as well.

In [None]:
# remove encoding from companies and rounds df
companies['permalink'] = companies.permalink.str.encode('utf-8').str.decode('ascii', 'ignore')
companies['name'] = companies.name.str.encode('utf-8').str.decode('ascii', 'ignore')
rounds['company_permalink'] = rounds.company_permalink.str.encode('utf-8').str.decode('ascii', 'ignore')

Let's now look at the companies present in the companies df but not in rounds df - ideally there should be none. 

In [None]:
# companies present in companies df but not in rounds df
companies.loc[~companies['permalink'].isin(rounds['company_permalink']), :]

In [None]:
# Look at unique values again
len(rounds.company_permalink.unique())

Now it makes sense - there are 66368 unique companies in both the rounds and companies dataframes.

It is possible that a similar encoding problems are present in the companies file as well. Let's look at the companies which are present in the companies file but not in the rounds file - if these have special characters, then it is most likely because the companies file is encoded (while rounds is not).

In [None]:
# companies present in companies df but not in rounds df
companies[~companies['permalink'].isin(rounds['company_permalink'])]

In [None]:
# quickly verify that there are 66368 unique companies in both
# and that only the same 66368 are present in both files

# unqiue values
print(len(companies.permalink.unique()))
print(len(rounds.company_permalink.unique()))

# present in rounds but not in companies
print(len(rounds.loc[~rounds['company_permalink'].isin(companies['permalink']), :]))
print(len(companies[~companies['permalink'].isin(rounds['company_permalink'])]))

In [None]:
# missing values in companies df
companies.isnull().sum()

In [None]:
# missing values in rounds df
rounds.isnull().sum()

Since there are no misisng values in the permalink or company_permalink columns, let's merge the two and then work on the master dataframe.

In [None]:
# merging the two dfs
master = pd.merge(companies, rounds, how="inner", left_on="permalink", right_on="company_permalink")
master.head()

In [None]:
# removing redundant columns
master =  master.drop(['company_permalink'], axis=1) 

In [None]:
# summing up the missing values (column-wise) and displaying fraction of NaNs
round(100*(master.isnull().sum()/len(master.index)), 2)

Clearly, the column funding_round_code is useless (with about 73% missing values). Also, for the business objectives given, the columns homepage_url, founded_at, state_code, region and city need not be used.

In [None]:
# dropping columns 
master = master.drop(['funding_round_code', 'homepage_url', 'founded_at', 'state_code', 'region', 'city'], axis=1)
master.head()

In [None]:
# summing up the missing values (column-wise) and displaying fraction of NaNs
round(100*(master.isnull().sum()/len(master.index)), 2)

Note that the column raised_amount_usd is an important column, since that is the number we want to analyse (compare, means, sum etc.). That needs to be carefully treated.

Also, the column country_code will be used for country-wise analysis, and category_list will be used to merge the dataframe with the main categories.

Let's first see how we can deal with missing values in raised_amount_usd

In [None]:
# summary stats of raised_amount_usd
master['raised_amount_usd'].describe()

The mean is somewhere around USD 10 million, while the median is only about USD 1m. The min and max values are also miles apart.

In general, since there is a huge spread in the funding amounts, it will be inappropriate to impute it with a metric such as median or mean. Also, since we have quite a large number of observations, it is wiser to just drop the rows.

Let's thus remove the rows having NaNs in raised_amount_usd.

In [None]:
# removing NaNs in raised_amount_usd
master = master[~np.isnan(master['raised_amount_usd'])]
round(100*(master.isnull().sum()/len(master.index)), 2)

Let's now look at the column country_code. To see the distribution of the values for categorical variables, it is best to convert them into type 'category'.

In [None]:
country_codes = master['country_code'].astype('category')

In [None]:
# displaying frequencies of each category
country_codes.value_counts()

In [None]:
# viewing fractions of counts of country_codes
100*(master['country_code'].value_counts()/len(master.index))

Now, we can either delete the rows having country_code missing (about 6% rows), or we can impute them by USA. Since the number 6 is quite small, and we have a decent amount of data, it may be better to just remove the rows.

Note that np.isnan does not work with arrays of type 'object', it only works with native numpy type (float). Thus, you can use pd.isnull() instead.

In [None]:
# removing rows with missing country_codes
master = master[~pd.isnull(master['country_code'])]

# look at missing values
round(100*(master.isnull().sum()/len(master.index)), 2)

Note that the fraction of missing values in the remaining dataframe has also reduced now - only 0.65% in category_list. Let's thus remove those as well.

Note Optionally, you could have simply let the missing values in the dataset and continued the analysis. There is nothing wrong with that. But in this case, since we will use that column later for merging with the 'main_categories', removing the missing values will be quite convenient (and again - we have enough data).

In [None]:
# removing rows with missing category_list values
master = master[~pd.isnull(master['category_list'])]

# look at missing values
round(100*(master.isnull().sum()/len(master.index)), 2)

In [None]:
master.info()

In [None]:
# Now the data looks nice and clean, let's proceed with the analysis.

# Data Analysis

## Funding Type Analysis
This is the first of the three goals of data analysis – investment type analysis. 

The funding types such as seed, venture, angel, etc. depend on the type of the company (startup, corporate, etc.), its stage (early stage startup, funded startup, etc.), the amount of funding (a few million USD to a billion USD), and so on. For example, seed, angel and venture are three common stages of startup funding.

Seed/angel funding refers to early-stage startups whereas venture funding occurs after seed or angel stage/s and involves a relatively higher amount of investment.
Private equity type investments are associated with much larger companies and involve much higher investments than venture type. Startups which have grown in scale may also receive private equity funding. This means that if a company has reached the venture stage, it would have already passed through the angel or seed stage/s. 

Spark Funds wants to choose one of these four investment types for each potential investment they will make. 

Considering the constraints of Spark Funds, you have to decide one funding type which is most suitable for them.

1. Calculate the average investment amount for each of the four funding types (venture, angel, seed, and private equity) and report the answers in Table 2.1

2. Based on the average investment amount calculated above, which investment type do you think is the most suitable for Spark Funds?

In [None]:
# first, let's filter the df so it only contains the four specified funding types
df = master[(master.funding_round_type == "venture") | 
            (master.funding_round_type == "angel") | 
            (master.funding_round_type == "seed") | 
            (master.funding_round_type == "private_equity") ]
df.head()


Now, we have to compute a representative value of the funding amount for each type of investment. We can either choose the mean or the median - let's have a look at the distribution of raised_amount_usd to get a sense of the distribution of data.

In [None]:
# distribution of raised_amount_usd
plt1 = sns.boxplot(y=df['raised_amount_usd'])
plt.yscale('log')
plt1.set(ylabel = 'Funding ($)')
plt.tight_layout()
plt.show()

In [None]:
# First let's convert funding raised in million USD
df['raised_amount_usd'] = round(df['raised_amount_usd']/1000000,2)

In [None]:
# summary metrics
df['raised_amount_usd'].describe()

Note that there's a significant difference between the mean and the median - USD 9.5m and USD 2m. Let's also compare the summary stats across the four categories.

In [None]:
# comparing summary stats across four categories
sns.boxplot(x='funding_round_type', y='raised_amount_usd', data=df)
plt.yscale('log')
plt.show()

In [None]:
# compare the mean and median values across categories
df.pivot_table(values='raised_amount_usd', columns='funding_round_type', aggfunc=[np.median, np.mean])

Note that there's a large difference between the mean and the median values for all four types. For type venture, for e.g. the median is about 20m while the mean is about 70m.

Thus, the choice of the summary statistic will drastically affect the decision (of the investment type). Let's choose median, since there are quite a few extreme values pulling the mean up towards them - but they are not the most 'representative' values.

In [None]:
# compare the median investment amount across the types
df.groupby('funding_round_type')['raised_amount_usd'].median().sort_values(ascending=False)

The median investment amount for type 'private_equity' is approx. USD 20m, which is beyond Spark Funds' range of 5-15m. The median of 'venture' type is about USD 5m, which is suitable for them. The average amounts of angel and seed types are lower than their range.

Thus, 'venture' type investment will be most suited to them.

## Country Analysis

This is the second goal of analysis — country analysis. 

Now that you know the type of investment suited for Spark Funds, let's narrow down the countries. 

Spark Funds wants to invest in countries with the highest amount of funding for the chosen investment type. This is a part of its broader strategy to invest where most investments are occurring. 

Spark Funds wants to see the top nine countries which have received the highest total funding (across ALL sectors for the chosen investment type)

For the chosen investment type, make a data frame named top9 with the top nine countries (based on the total investment amount each country has received) 

Identify the top three English-speaking countries in the data frame top9.

In [None]:
# filter the df for private equity type investments
df = df[df.funding_round_type=="venture"]

# group by country codes and compare the total funding amounts
country_wise_total = df.groupby('country_code')['raised_amount_usd'].sum().sort_values(ascending=False)
country_wise_total[:9]

Among the top 9 countries, USA, GBR and IND are the top three English speaking countries. Let's filter the dataframe so it contains only the top 3 countries.

In [None]:
# filtering for the top three countries
df = df[(df.country_code=='USA') | (df.country_code=='GBR') | (df.country_code=='IND')]
df.head()

In [None]:
# boxplot to see distributions of funding amount across countries
plt.figure(figsize=(10, 10))
sns.boxplot(x='country_code', y='raised_amount_usd', data=df)
plt.yscale('log')
plt.show()

## Sector Analysis
This is the third goal of analysis — sector analysis. 

When we say sector analysis, we refer to one of the eight main sectors (named main_sector) listed in the mapping file (note that ‘Other’ is one of the eight main sectors). This is to simplify the analysis by grouping the numerous category lists (named ‘category_list’) in the mapping file. For example, in the mapping file, category_lists such as ‘3D’, ‘3D Printing’, ‘3D Technology’, etc. are mapped to the main sector ‘Manufacturing’. 

Also, for some companies, the category list is a list of multiple sub-sectors separated by a pipe (vertical bar |). For example, one of the companies’ category_list is Application Platforms|Real Time|Social Network Media. 

You discuss with the CEO and come up with the business rule that the first string before the vertical bar will be considered the primary sector. In the example above, ‘Application Platforms’ will be considered the primary sector.

Extract the primary sector of each category list from the category_list column

Use the mapping file 'mapping.csv' to map each primary sector to one of the eight main sectors (Note that ‘Others’ is also considered one of the main sectors)

In [None]:
df["category_list"] = df["category_list"].str.split("|").str.get(0)
df.head()

In [None]:
mapping_table = pd.DataFrame(pd.read_csv( "../input/mapping.csv",))
mapping_table.head()

In [None]:
# Code for a merged data frame with each primary sector mapped to its main sector
# (the primary sector should be present in a separate column).
long_map = pd.melt(mapping_table, id_vars=['category_list'], var_name='main_sector')
long_map = long_map[long_map['value']==1]
long_map = long_map.drop('value',1)
long_map.head()

In [None]:
df = pd.merge(df, long_map, on = 'category_list' , how = 'inner')
df.head()

In [None]:
df.info()

## Sector Analysis
Now you have a data frame with each company’s main sector (main_sector) mapped to it. When we say sector analysis, we refer to one of the eight main sectors.

Also, you know the top three English speaking countries and the most suitable funding type for Spark Funds. Let’s call the three countries 'Country 1', 'Country 2' and 'Country 3' and the funding type 'FT'. 

Also, the range of funding preferred by Spark Funds is 5 to 15 million USD. 

Now, the aim is to find out the most heavily invested main sectors in each of the three countries (for funding type FT and investments range of 5-15 M USD).

Create three separate data frames D1, D2 and D3 for each of the three countries containing the observations of funding type FT falling within the 5-15 million USD range. The three data frames should contain:

All the columns of the master_frame along with the primary sector and the main sector

The total number (or count) of investments for each main sector in a separate column

The total amount invested in each main sector in a separate column

Using the three data frames, you can calculate the total number and amount of investments in each main sector.

In [None]:
# summarising the sector-wise number and sum of venture investments across three countries

# first, let's also filter for investment range between 5 and 15m
df = df[(df['raised_amount_usd'] >= 5) & (df['raised_amount_usd'] <= 15)]
df.head()

In [None]:
# First english speaking company 'USA' for funding type venture
D1 = df[df.country_code == 'USA']
# Second english speaking company 'Great Britain' for funding type venture
D2 = df[df.country_code == 'GBR']
# Third english speaking company 'India' for funding type venture
D3 = df[df.country_code == 'IND']

In [None]:
# groupby country, sector and compute the count and sum
df.groupby(['country_code', 'main_sector']).raised_amount_usd.agg(['count', 'sum'])

In [None]:
# plotting sector-wise count and sum of investments in the three countries
plt.figure(figsize=(16, 14))

plt.subplot(2, 1, 1)
p = sns.barplot(x='main_sector', y='raised_amount_usd', hue='country_code', data=df, estimator=np.sum)
p.set_xticklabels(p.get_xticklabels(),rotation=30)
plt.title('Total Invested Amount (USD)')

plt.subplot(2, 1, 2)
q = sns.countplot(x='main_sector', hue='country_code', data=df)
q.set_xticklabels(q.get_xticklabels(),rotation=30)
plt.title('Number of Investments')

plt.tight_layout()
plt.show()

# __<font color=green>FINISH!!!</font>__😎😎😎😎
