## Create classes

This notebook is the first one that should be run in order to start the TeckRank. Here, we upload the data and we create the classes for companies and technologies. 

Then we save the results as two dictionaries (company_name:class_company and tech_name:class_tech), which contain all the needed information for the nest steps

### Table of contents:

* [Download data from CSV](#down)
* [Data cleaning](#cleaning)
* [Select companies in cybersecurity](#cyber)
* [Create graph and dictionaries](#create_graph)
* [Save graph and dictionaries](#save)
* [Quick loop](#loop0)

In [1]:
flag_cybersecurity = True

In [2]:
import math
import arrow
import ipynb 
import os.path

import json
import pickle
import sys
import random
import operator

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import numpy as np

from dotenv import load_dotenv
from networkx.algorithms.bipartite.matrix import biadjacency_matrix
from networkx.algorithms import bipartite
from importlib import reload
from typing import List


In [3]:
# import functions from py file 

import functions.fun
reload(functions.fun)
from functions.fun import CB_data_cleaning, df_from_api_CB, extract_nodes, extract_data_from_column, field_extraction
from functions.fun import nx_dip_graph_from_pandas, plot_bipartite_graph, filter_dict, check_desc
from functions.fun import extract_classes_company_tech, degree_bip, insert_data_classes

In [4]:
# import functions from py file 

import functions.fun_meth_reflections
reload(functions.fun_meth_reflections)
from functions.fun_meth_reflections import zero_order_score, Gct_beta, Gtc_alpha, make_G_hat, next_order_score, generator_order_w
from functions.fun_meth_reflections import M_test_triangular, w_stream, find_convergence, rank_df_class, w_star_analytic

In [6]:
# import classes 

import classes
reload(classes)

<module 'classes' from 'c:\\Users\\tjga9\\Documents\\Tomas\\EPFL\\MA3\\CYD PDS\\Code\\TechRank\\5-TechRank-main\\5-TechRank-main\\classes.py'>

## Download data from CSV <a class="anchor" id="down"></a>

In [38]:
df_start = pd.read_csv(r"C:\Users\tjga9\Documents\Tomas\EPFL\MA3\CYD PDS\Code\TechRank\5-TechRank-main\5-TechRank-main\data\sample CB data\organizations.csv")

df_start.head()
# 1. Vérifie immédiatement après le chargement des données
print(f"Juste après le chargement : {len(df_start)} lignes")

Juste après le chargement : 50 lignes


In [8]:
df_start.columns

Index(['uuid', 'name', 'type', 'permalink', 'cb_url', 'rank', 'created_at',
       'updated_at', 'legal_name', 'roles', 'domain', 'homepage_url',
       'country_code', 'state_code', 'region', 'city', 'address',
       'postal_code', 'status', 'short_description', 'category_list',
       'category_groups_list', 'num_funding_rounds', 'total_funding_usd',
       'total_funding', 'total_funding_currency_code', 'founded_on',
       'last_funding_on', 'closed_on', 'employee_count', 'email', 'phone',
       'facebook_url', 'linkedin_url', 'twitter_url', 'logo_url', 'alias1',
       'alias2', 'alias3', 'primary_role', 'num_exits', 'revenue_range'],
      dtype='object')

## Data Cleaning <a class="anchor" id="cleaning"></a>

We decide to use as key the name. For the future, it would be better to use the uuid

- `df_start`: dataset before cleaning
- `df` : datsety after cleaning

In [10]:
# we create the lists needed as input in the function to clean the data

to_drop = [
    'type',
    'permalink',
    'cb_url',   
    'created_at',
    'domain',
    'address',
    'state_code',
    'updated_at',
    'legal_name',
    'roles',
    'postal_code',
    'homepage_url',
    'num_funding_rounds',
    'total_funding_currency_code',
    'phone',
    'email',
    'num_exits',
    'alias2',
    'alias3',
    'num_exits',
    'logo_url',
    'alias1',
    'last_funding_on',
    'twitter_url',
    'facebook_url'
]

# to_rename = { 'category_groups_list': 'category_groups' }
to_rename = { 'category_list': 'category_groups' }

drop_if_nan = [
    'category_groups',
    'rank',
    'short_description'
]

to_check_double = {}

sort_by = "rank"

In [39]:
# clean data: from df_start to df
df = CB_data_cleaning(df_start, to_drop, to_rename, to_check_double, drop_if_nan, sort_by)
print(f"Juste après le chargement : {len(df)} lignes")


Juste après le chargement : 47 lignes


In [13]:
# show cleaned dataset
# .head() shows the first 5 rows of the dataframe
df.head()

Unnamed: 0,uuid,name,rank,country_code,region,city,status,short_description,category_groups,category_groups_list,total_funding_usd,total_funding,founded_on,closed_on,employee_count,linkedin_url,primary_role,revenue_range
3,0fbfb7ac-4015-1561-6d42-ec9c4a87a324,Paladina Health,7675.0,USA,Colorado,Denver,acquired,Paladina Health is an innovative employer-spon...,"Health Care,Hospital,Medical,Personal Health",Health Care,165000000.0,165000000.0,2010-01-01,,11-50,https://www.linkedin.com/company/paladina-health,company,
46,17818d55-4f93-94b1-6b80-575a7cdc5878,Critical Force,13752.0,FIN,Oulu,Kajaani,operating,Critical Force is a Finnish video game company,"Gaming,Mobile Devices,Video Games","Consumer Electronics,Gaming,Hardware,Mobile",10760303.0,10760300.0,2012-01-01,,11-50,https://www.linkedin.com/company/critical-forc...,company,
38,daa9dd72-86f3-bafd-c9ec-88fb18eeed2a,ALBERT,15806.0,JPN,Tokyo,Tokyo,ipo,ALBERT offers businesses with analytics and co...,"Analytics,Database","Data and Analytics,Software",21756468.0,2409890000.0,2005-07-01,,unknown,,company,
8,0cbe819b-a9c0-d059-b5c3-7e7a859c4ec7,Trainline Europe,19943.0,FRA,Ile-de-France,Paris,acquired,Trainline (formerly Captain Train) sells train...,"Internet,Ticketing,Travel","Events,Internet Services,Media and Entertainme...",11984031.0,9400000.0,2009-02-08,,11-50,https://www.linkedin.com/showcase/trainline-eu/,company,
5,b612bc69-c1dd-a92e-a349-a5b767922df8,Applied BioCode,37317.0,USA,California,Santa Fe Springs,operating,Applied BioCode commercializes a multiplexing ...,"Biotechnology,Genetics,Health Diagnostics,Life...","Biotechnology,Health Care,Science and Engineering",17505680.0,17505680.0,2008-01-01,,11-50,https://www.linkedin.com/company/appliedbiocode,company,$1M to $10M


In [40]:
df.columns
print(f"Juste après le chargement : {len(df)} lignes")

Juste après le chargement : 47 lignes


In [41]:
# Vérifier AVANT la conversion
print("AVANT la conversion :")
print(f"Type : {type(df['category_groups'].iloc[0])}")
print(f"Exemple : {df['category_groups'].iloc[0]}")

# Faire la conversion
def convert_to_list(string):
    li = list(string.split(","))
    return li

if type(df["category_groups"].iloc[0]) != list:
    df["category_groups"] = df["category_groups"].apply(convert_to_list)
    print("\n✓ Conversion effectuée")

# Vérifier APRÈS la conversion
print("\nAPRÈS la conversion :")
print(f"Type : {type(df['category_groups'].iloc[0])}")
print(f"Exemple : {df['category_groups'].iloc[0]}")

# Vérifier que toutes les valeurs sont des listes
print(f"\nToutes les valeurs sont des listes : {df['category_groups'].apply(lambda x: isinstance(x, list)).all()}")

# Vérifier qu'on a toujours 47 lignes
print(f"Nombre de lignes après conversion : {len(df)}")

AVANT la conversion :
Type : <class 'str'>
Exemple : Health Care,Hospital,Medical,Personal Health

✓ Conversion effectuée

APRÈS la conversion :
Type : <class 'list'>
Exemple : ['Health Care', 'Hospital', 'Medical', 'Personal Health']

Toutes les valeurs sont des listes : True
Nombre de lignes après conversion : 47


In [42]:
# convert category_groups to list

def convert_to_list(string):
    li = list(string.split(","))
    return li
  
if type(df["category_groups"][df.index[0]]) != list:
    df["category_groups"] = [convert_to_list(x) for x in df["category_groups"]]

### Select companies in cybersecurity <a class="anchor" id="cyber"></a>


We decide to select only companies that work in the cybersecurity field. The algorithm is easily extendible to any field: we only have to change the _field_words_ list word.

Please note that if we want to select also some sub-sample, we have to cut the dataset at this stage (as it is done in the quick loop at the end of this notebook).


In [None]:
flag_cybersecurity

True

In [43]:
# regarder types / exemples
print(df['category_groups'].apply(lambda x: type(x)).value_counts())
print(df['category_groups'].head(20).tolist())
# compter valeurs vides / NaN
print("na:", df['category_groups'].isna().sum())

category_groups
<class 'list'>    47
Name: count, dtype: int64
[['Health Care', 'Hospital', 'Medical', 'Personal Health'], ['Gaming', 'Mobile Devices', 'Video Games'], ['Analytics', 'Database'], ['Internet', 'Ticketing', 'Travel'], ['Biotechnology', 'Genetics', 'Health Diagnostics', 'Life Science'], ['Leisure'], ['Higher Education', 'Nursing and Residential Care', 'Universities'], ['Developer APIs', 'Developer Tools', 'E-Commerce', 'Mobile', 'Mobile Payments', 'Payments'], ['Digital Marketing', 'Marketing', 'Software'], ['Electronics', 'Hardware', 'Manufacturing'], ['Dental', 'Health Care', 'Manufacturing', 'Medical', 'Medical Device'], ['Universities'], ['Internet'], ['Health Care'], ['Food Processing'], ['Electronics', 'Financial Services', 'Payments', 'Sales'], ['Consulting'], ['Billing', 'Finance', 'Mobile Payments'], ['Education', 'Social Media'], ['Communities', 'Developer Tools', 'Enterprise Software', 'Location Based Services', 'Mobile', 'SaaS']]
na: 0


In [44]:

print("avant : ", df.shape)
df, flag_cybersecurity = field_extraction('cybersecurity', df)
print("après  : ", df.shape, " flag:", flag_cybersecurity)
print(df.head(10))


avant :  (47, 18)
après  :  (0, 18)  flag: True
Empty DataFrame
Columns: [uuid, name, rank, country_code, region, city, status, short_description, category_groups, category_groups_list, total_funding_usd, total_funding, founded_on, closed_on, employee_count, linkedin_url, primary_role, revenue_range]
Index: []


In [None]:
keywords = ['cyber', 'security', 'cybersecurity']
mask = df['category_groups'].apply(
    lambda lst: isinstance(lst, list) and any(k.lower() in ' '.join(lst).lower() for k in keywords)
)
print("matched in category_groups:", mask.sum())
print(df[mask].head(10))
# aussi vérifier descriptions
desc_mask = df['short_description'].astype(str).str.contains('|'.join(keywords), case=False, na=False)
print("matched in short_description:", desc_mask.sum())
print(df[desc_mask].head(10))

matched in category_groups: 0
Empty DataFrame
Columns: []
Index: []
matched in short_description: 0
Empty DataFrame
Columns: [uuid, name, rank, country_code, region, city, status, short_description, category_groups, category_groups_list, total_funding_usd, total_funding, founded_on, closed_on, employee_count, linkedin_url, primary_role, revenue_range]
Index: []


In [None]:
df.head()

Unnamed: 0,uuid,name,rank,country_code,region,city,status,short_description,category_groups,category_groups_list,total_funding_usd,total_funding,founded_on,closed_on,employee_count,linkedin_url,primary_role,revenue_range


In [None]:
df['short_description'].values

array([], dtype=object)

### Create Companies and Technologies classes

#### Ranking

I personally appreciate the ranking that you provide for each company. However, I did not quite understand what's the magic behind it. Is there any chance to get some more insight/details, also considering that we do have an NDA in place?

- Crunchbase rank uses Crunchbase’s intelligent algorithms to score and rank entities (e.g. Company, People, Investors, etc.).
- The algorithms take into account many different variables; ranging from funding events, the entity’s strength of relationships with other entities in the Crunchbase ecosystem, the level of engagement from our website, news articles, and acquisitions.

    - A company’s Rank is fluid and subject to rising and decaying over time with time-sensitive events. Events such as product launches, funding events, leadership changes, and news affect a company’s Crunchbase Rank.


- The Crunchbase rank shows where an entity falls in the Crunchbase database relative to all other entities in that entity type (i.e. if searching for companies, you will see where a specific company ranks relative to all other companies). An entity with a Crunchbase Rank of 1 has the highest rank relative to all other entities of that type.

I would also suggest leveraging our Trend Score - 7 Day, 30 Day, 90 Day (e.g. Company, People, Investors, etc.)

- While Rank shows context, Crunchbase Trend Score demonstrates activity. A company’s rank will change based on activity (fundraising, news, etc.) and Trend Score is an indicator of how much their rank is changing at any given time.
- Crunchbase Trend Score tracks the fluctuations in Rank. As a company’s rank changes, so do its Trend Score.
- Trend Score measures the rate of a company’s activity on a 20-point (+10 <-> -10) scale. Scores closer to +10 mean it’s moving up in rank much faster compared to their peers. Scores closer to -10 mean it’s moving down.
- For example, a company that announces its first funding round will likely experience a jump in its Rank, pushing its Trend Score up as its page views, article counts, funding amount, team members, etc., begin to increase.


## Create graph and dictionaries <a class="anchor" id="create_graph"></a>

In [45]:
# Extracts the dictionaries of Companies and Technologies from the dataset and create the network
df_limited = df  # Par défaut, utilise tout le DataFrame
[dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)

In [46]:
print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")

We have 0 companies and 0 technologies


## Save dictionaries and network <a class="anchor" id="save"></a>

In [None]:
# Save dictionaries in a pickle files

# if flag_cybersecurity==False: # all fields
#     name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
#     name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
# else: # only companies in cybersecurity
#     name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
#     name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"
if flag_cybersecurity==False: # all fields
    name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
    name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
else: # only companies in cybersecurity
    name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
    name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

# companies
with open(name_file_com, "wb") as f:
    pickle.dump(dict_companies, f)

#technologies
with open(name_file_tech, "wb") as f:
    pickle.dump(dict_tech, f)

In [None]:
# Save the bipartite graph as gpickle:

# if flag_cybersecurity==False: # all fields
#     name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'                                     
# else: # only companies in cybersecurity
#     name_file_graph = 'savings/networks/cybersecurity_comp_'+ str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'
                                                       
# nx.write_gpickle(B, name_file_graph)

# Save the bipartite graph as gpickle:

# Save the bipartite graph as gpickle:
if flag_cybersecurity == False:  # all fields
    name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'
else:  # only companies in cybersecurity
    name_file_graph = 'savings/networks/cybersecurity_comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'

# Sauvegarder le graphe avec pickle
with open(name_file_graph, "wb") as f:
    pickle.dump(B, f)

print(f"Graphe sauvegardé dans {name_file_graph}")

Graphe sauvegardé dans savings/networks/cybersecurity_comp_0_tech_0.gpickle


## Quick loop  <a class="anchor" id="loop0"></a>

With quick loop, we mean that we do all the step of the previous sections, in order to update the dictionaries, for all size, in only one loop.

In this part, you won't see many comments because everything has been already explained before :)

In [None]:
limits = [2443]
flag_cybersecurity = True

In [None]:
# for i in limits:
#     df_limited = df[:i] # set limits
#     [dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)
#     print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")
    
#     # Save dictionaries in a pickle files

#     if flag_cybersecurity==False: # all fields
#         name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
#         name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
#     else: # only companies in cybersecurity
#         name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
#         name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

#     # companies
#     with open(name_file_com, "wb") as f:
#         pickle.dump(dict_companies, f)

#     #technologies
#     with open(name_file_tech, "wb") as f:
#         pickle.dump(dict_tech, f)
        
#     if flag_cybersecurity==False: # all fields
#         name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'                                     
#     else: # only companies in cybersecurity
#         name_file_graph = 'savings/networks/cybersecurity_comp_'+ str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'

#     nx.write_gpickle(B, name_file_graph)

# --------------------NO LONGER WRITE_GPICKLE AVAILABLE--------------------

#Alternative way to save the graph with pickle

for i in limits:
    df_limited = df[:i]  # set limits
    [dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)
    print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")
    
    # Save dictionaries in pickle files
    if flag_cybersecurity == False:  # all fields
        name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
        name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
    else:  # only companies in cybersecurity
        name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
        name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

    # Save companies dictionary
    with open(name_file_com, "wb") as f:
        pickle.dump(dict_companies, f)

    # Save technologies dictionary
    with open(name_file_tech, "wb") as f:
        pickle.dump(dict_tech, f)
        
    # Save the bipartite graph
    if flag_cybersecurity == False:  # all fields
        name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'
    else:  # only companies in cybersecurity
        name_file_graph = 'savings/networks/cybersecurity_comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'

    # Save the graph using pickle
    with open(name_file_graph, "wb") as f:
        pickle.dump(B, f)

    print(f"Graphe sauvegardé dans {name_file_graph}")

We have 0 companies and 0 technologies
Graphe sauvegardé dans savings/networks/cybersecurity_comp_0_tech_0.gpickle
