## Demo of the CrunchBase data

This highlights key aspects of the cb+ dataset, which enriches CB with information about their sectors based on a clustering & supervised machine learning analysis

## Preamble

In [None]:
%run ../notebook_preamble.ipy

In [None]:
import seaborn as sns

In [None]:
def drop_diagonal(corr):
    '''
    Utility to drop diagonal in a correlation matrix so we can visualise it as a heatmap
    
    '''
    
    sector_corr_array = np.array(corr)

    np.fill_diagonal(sector_corr_array,0)

    out = pd.DataFrame(sector_corr_array,index=corr.index,columns=corr.columns)

    return(out)

## Load data

In [None]:
cb_meta = pd.read_csv('../../data/processed/17_9_2019_predicted_metadata.csv',compression='zip')

cb_labels = pd.read_csv('../../data/processed/17_9_2019_predicted_sectors.csv',compression='zip')

## Showcase

### `cb_meta` 

`cb_meta` contains metadata about CrunchBase companies for which we have predicted labels.

Some observations:

* This only includes organisations in the company role
* This only includes organisations with long descriptions


In [None]:
cb_meta.head()

In [None]:
cb_meta.shape

### `cb_labels` 

`cb_labels` contains predicted probabilities for all the sectors we are studying (61 sectors, based on a clustering analysis carried out in `1_jmg_load`

In [None]:
cb_labels.head()

In [None]:
#Remove that unnamed column
cb_labels = cb_labels.iloc[:,1:]

sectors = cb_labels.columns

In [None]:
cb_labels.shape

#### Combine them into `cb_combi`

In [None]:
cb_combi = pd.concat([cb_meta,cb_labels],axis=1)

#### Trends

In [None]:
#First we need to create a year variable
cb_combi['year']= [int(x.split('-')[0]) if pd.isnull(x)==False else np.nan for x in cb_combi['founded_on']]

**Number of companies**

In [None]:
#Trends
cb_combi['year'].value_counts().loc[np.arange(min(cb_combi['year']),max(cb_combi['year']))].fillna(0).plot()

CrunchBase includes data about very old companies

In [None]:
cb_meta.columns

**Funding**

In [None]:
cb_combi.groupby('year')['funding_total_usd'].sum().plot()

Note that this is capturing amount of funding by *year when a company was founded*

#### Geographies

In [None]:
top_countries = cb_combi['country'].value_counts(normalize=True)[:20].index

In [None]:
cb_combi['country'].value_counts(normalize=True)[:20].plot.bar()

Note that this country variable is based on Nesta's own geocoding

In [None]:
ax = (cb_combi.groupby('country')['funding_total_usd'].sum().sort_values(ascending=False)/1e9)[:20].plot.bar()

ax.set_ylabel('Billion $')

In [None]:
ax = pd.crosstab(cb_combi['year'],cb_combi['country'],normalize=0).loc[np.arange(2000,2019),top_countries[:10]].rolling(3).mean().dropna().plot.bar(stacked=True,width=0.9)
ax.legend(bbox_to_anchor=(1,1))

ax.set_xlabel('% of all activity accounted by country')

#### Sectors

We label each company with its top sector. We also create a variable that only considers a company in a sector if its weight is >0.75

In [None]:
#Focus on dominant sector

cb_combi['dominant_sector']= cb_combi[sectors].max(axis=1)>0.75

In [None]:
cb_combi['sector_top'] = cb_combi[sectors].idxmax(axis=1)

In [None]:
cb_combi['sector_dom'] = [r['sector_top'] if r['dominant_sector']==True else 'mixed' for cid,r in cb_combi.iterrows()]

In [None]:
cb_combi['sector_dom'].value_counts().head()

Although it is somewhat surprising to find health as the largest vertical, we assume that this is at least partly caused by the aggregate nature of the category by comparison to eg software

#### Random check of results

In [None]:
import random

In [None]:
def get_example(df,number,length):
    '''
    Gets random examples in a field
    
    Args:
        Df is the dataframe we want to use
        number is the number of examples we want
        length is the length of the examples
    
    '''
    
    choose = random.sample(list(df.index),number)
    
    for x in df.loc[choose]['long_description']:
        
        print(x[:length])
        print('\n')
    

In [None]:
for x in sectors:
    
    print(x)
    print('===')
    
    in_sector = cb_combi.loc[cb_combi['sector_dom']==x]
    
    get_example(in_sector,3,700)
    
    print('\n')

In [None]:
sector_trends = pd.crosstab(cb_combi['sector_dom'],cb_combi['year'],normalize=0).loc[:,np.arange(2005,2019)].sort_values(2018,ascending=False)

fig,ax = plt.subplots(figsize=(7,15))

sns.heatmap(sector_trends,cmap='bwr',ax=ax)

**Funding trends**

In [None]:
funding_sector_trends= cb_combi.groupby(
    ['year','sector_dom'])['funding_total_usd'].sum().reset_index(drop=False).pivot_table(index='sector_dom',columns='year',values='funding_total_usd').fillna(0)

funding_sector_trend_norm = funding_sector_trends.apply(lambda x: x/x.sum(),axis=1)

fig,ax = plt.subplots(figsize=(7,18))

sns.heatmap(funding_sector_trend_norm.sort_values(2018,ascending=False).loc[:,np.arange(2008,2019)],ax=ax,cmap='bwr')

#### Sector clusters

We want to create a cheap visualisation of a network. We will use sector similarities and the sns clustermap

In [None]:
from sklearn.metrics import pairwise_distances

In [None]:
#sims = drop_diagonal(pd.DataFrame(1-pairwise_distances(np.array(cb_combi[sectors].applymap(lambda x: 1 if x>0.5 else 0)).T,metric='jaccard'),index=sectors,columns=sectors))


sims = drop_diagonal(pd.DataFrame(1-pairwise_distances(cb_combi[sectors].T,metric='cosine'),index=sectors,columns=sectors))

In [None]:
sns.clustermap(sims,cmap='bwr',figsize=(16,16))

In [None]:
from data_getters.labs.core import upload_file

In [None]:
cb_combi.to_csv(f'../../data/processed/{today_str}_cb_sector_labelled.csv')