## Create classes

This notebook is the first one that should be run in order to start the TeckRank. Here, we upload the data and we create the classes for companies and technologies. 

Then we save the results as two dictionaries (company_name:class_company and tech_name:class_tech), which contain all the needed information for the nest steps

### Table of contents:

* [Download data from CSV](#down)
* [Data cleaning](#cleaning)
* [Select companies in cybersecurity](#cyber)
* [Create graph and dictionaries](#create_graph)
* [Save graph and dictionaries](#save)
* [Quick loop](#loop0)

In [53]:
flag_cybersecurity = True

In [54]:
import math
import arrow
import ipynb 
import os.path

import json
import pickle
import sys
import random
import operator

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import numpy as np

from dotenv import load_dotenv
from networkx.algorithms.bipartite.matrix import biadjacency_matrix
from networkx.algorithms import bipartite
from importlib import reload
from typing import List


In [55]:
# import functions from py file 

import functions.fun
reload(functions.fun)
from functions.fun import CB_data_cleaning, df_from_api_CB, extract_nodes, extract_data_from_column, field_extraction
from functions.fun import nx_dip_graph_from_pandas, plot_bipartite_graph, filter_dict, check_desc
from functions.fun import extract_classes_company_tech, degree_bip, insert_data_classes

In [56]:
# import functions from py file 

import functions.fun_meth_reflections
reload(functions.fun_meth_reflections)
from functions.fun_meth_reflections import zero_order_score, Gct_beta, Gtc_alpha, make_G_hat, next_order_score, generator_order_w
from functions.fun_meth_reflections import M_test_triangular, w_stream, find_convergence, rank_df_class, w_star_analytic

In [57]:
# import classes 

import classes
reload(classes)

<module 'classes' from 'c:\\Users\\tjga9\\Documents\\Tomas\\EPFL\\MA3\\CYD PDS\\Code\\TechRank\\5-TechRank-main\\5-TechRank-main\\classes.py'>

## Download data from CSV <a class="anchor" id="down"></a>

In [22]:
df_start = pd.read_csv(r"C:\Users\tjga9\Documents\Tomas\EPFL\MA3\CYD PDS\Code\savings\csv\entities_exploration\organizations.csv")

df_start.head()
# 1. Vérifie immédiatement après le chargement des données
print(f"Juste après le chargement : {len(df_start)} lignes")

Juste après le chargement : 3975432 lignes


In [23]:
df_start.columns

Index(['uuid', 'name', 'type', 'permalink', 'cb_url', 'rank', 'created_at',
       'updated_at', 'legal_name', 'roles', 'domain', 'homepage_url',
       'country_code', 'state_code', 'region', 'city', 'address',
       'postal_code', 'status', 'short_description', 'category_list',
       'category_groups_list', 'num_funding_rounds', 'total_funding_usd',
       'total_funding', 'total_funding_currency_code', 'founded_on',
       'last_funding_on', 'closed_on', 'employee_count', 'email', 'phone',
       'facebook_url', 'linkedin_url', 'twitter_url', 'logo_url', 'alias1',
       'alias2', 'alias3', 'primary_role', 'num_exits'],
      dtype='object')

## Data Cleaning <a class="anchor" id="cleaning"></a>

We decide to use as key the name. For the future, it would be better to use the uuid

- `df_start`: dataset before cleaning
- `df` : datsety after cleaning

In [24]:
# we create the lists needed as input in the function to clean the data

to_drop = [
    'type',
    'permalink',
    'cb_url',   
    'created_at',
    'domain',
    'address',
    'state_code',
    'updated_at',
    'legal_name',
    'roles',
    'postal_code',
    'homepage_url',
    'num_funding_rounds',
    'total_funding_currency_code',
    'phone',
    'email',
    'num_exits',
    'alias2',
    'alias3',
    'num_exits',
    'logo_url',
    'alias1',
    'last_funding_on',
    'twitter_url',
    'facebook_url'
]

# to_rename = { 'category_groups_list': 'category_groups' }
to_rename = { 'category_list': 'category_groups' }

drop_if_nan = [
    'category_groups',
    'rank',
    'short_description'
]

to_check_double = {}

sort_by = "rank"

In [25]:
# clean data: from df_start to df
df = CB_data_cleaning(df_start, to_drop, to_rename, to_check_double, drop_if_nan, sort_by)
print(f"Juste après le chargement : {len(df)} lignes")


Juste après le chargement : 3752729 lignes


In [26]:
# show cleaned dataset
# .head() shows the first 5 rows of the dataframe
df.head()

Unnamed: 0,uuid,name,rank,country_code,region,city,status,short_description,category_groups,category_groups_list,total_funding_usd,total_funding,founded_on,closed_on,employee_count,linkedin_url,primary_role
359872,cf2c678c-b81a-80c3-10d1-9c5e76448e51,OpenAI,1.0,USA,California,San Francisco,operating,OpenAI creates artificial intelligence technol...,"Artificial Intelligence (AI),Generative AI,Mac...","Artificial Intelligence (AI),Data and Analytic...",61900120000.0,61900120000.0,2015-12-11,,1001-5000,https://www.linkedin.com/company/openai,company
1367042,e10aaff2-4d89-46d4-820b-b4f64b8d42ca,Anthropic,2.0,USA,California,San Francisco,operating,Anthropic is an AI research company that focus...,"Artificial Intelligence (AI),Generative AI,Inf...","Artificial Intelligence (AI),Data and Analytic...",17243880000.0,17243880000.0,2021-01-01,,1001-5000,https://www.linkedin.com/company/anthropicrese...,investor
1004841,d65024d7-58fb-4788-ad27-6218f7d93f55,CoreWeave,3.0,USA,New Jersey,Livingston,ipo,CoreWeave is a cloud-based AI infrastructure c...,"Artificial Intelligence (AI),Cloud Computing,C...","Artificial Intelligence (AI),Data and Analytic...",15429700000.0,15429700000.0,2017-01-01,,501-1000,https://www.linkedin.com/company/coreweave,investor
2926850,fb1e8b91-ca1f-46fa-b773-b147af49088c,xAI,4.0,USA,California,Burlingame,operating,XAI is an artificial intelligence startup that...,"Artificial Intelligence (AI),Generative AI,Inf...","Artificial Intelligence (AI),Data and Analytic...",22412370000.0,22412370000.0,2023-07-12,,11-50,https://www.linkedin.com/company/xai,company
1398961,feb8d007-c58b-45f8-bf42-4a31256fc0d4,Glean,5.0,USA,California,Palo Alto,operating,Glean develops an AI-based search engine softw...,"Artificial Intelligence (AI),Enterprise Softwa...","Artificial Intelligence (AI),Data and Analytic...",768200000.0,768200000.0,2019-01-01,,501-1000,https://www.linkedin.com/company/gleanwork,company


In [27]:
df.columns
print(f"Juste après le chargement : {len(df)} lignes")

Juste après le chargement : 3752729 lignes


In [28]:
# Vérifier AVANT la conversion
print("AVANT la conversion :")
print(f"Type : {type(df['category_groups'].iloc[0])}")
print(f"Exemple : {df['category_groups'].iloc[0]}")

# Faire la conversion
def convert_to_list(string):
    li = list(string.split(","))
    return li

if type(df["category_groups"].iloc[0]) != list:
    df["category_groups"] = df["category_groups"].apply(convert_to_list)
    print("\n✓ Conversion effectuée")

# Vérifier APRÈS la conversion
print("\nAPRÈS la conversion :")
print(f"Type : {type(df['category_groups'].iloc[0])}")
print(f"Exemple : {df['category_groups'].iloc[0]}")

# Vérifier que toutes les valeurs sont des listes
print(f"\nToutes les valeurs sont des listes : {df['category_groups'].apply(lambda x: isinstance(x, list)).all()}")

# Vérifier qu'on a toujours 47 lignes
print(f"Nombre de lignes après conversion : {len(df)}")

AVANT la conversion :
Type : <class 'str'>
Exemple : Artificial Intelligence (AI),Generative AI,Machine Learning,Natural Language Processing,SaaS

✓ Conversion effectuée

APRÈS la conversion :
Type : <class 'list'>
Exemple : ['Artificial Intelligence (AI)', 'Generative AI', 'Machine Learning', 'Natural Language Processing', 'SaaS']

Toutes les valeurs sont des listes : True
Nombre de lignes après conversion : 3752729


In [29]:
# convert category_groups to list

def convert_to_list(string):
    li = list(string.split(","))
    return li
  
if type(df["category_groups"][df.index[0]]) != list:
    df["category_groups"] = [convert_to_list(x) for x in df["category_groups"]]

### Select companies in cybersecurity <a class="anchor" id="cyber"></a>


We decide to select only companies that work in the cybersecurity field. The algorithm is easily extendible to any field: we only have to change the _field_words_ list word.

Please note that if we want to select also some sub-sample, we have to cut the dataset at this stage (as it is done in the quick loop at the end of this notebook).


In [30]:
flag_cybersecurity

True

In [31]:
# regarder types / exemples
print(df['category_groups'].apply(lambda x: type(x)).value_counts())
print(df['category_groups'].head(20).tolist())
# compter valeurs vides / NaN
print("na:", df['category_groups'].isna().sum())

category_groups
<class 'list'>    3752729
Name: count, dtype: int64
[['Artificial Intelligence (AI)', 'Generative AI', 'Machine Learning', 'Natural Language Processing', 'SaaS'], ['Artificial Intelligence (AI)', 'Generative AI', 'Information Technology', 'Machine Learning'], ['Artificial Intelligence (AI)', 'Cloud Computing', 'Cloud Infrastructure', 'Information Technology', 'Machine Learning'], ['Artificial Intelligence (AI)', 'Generative AI', 'Information Technology', 'Machine Learning'], ['Artificial Intelligence (AI)', 'Enterprise Software', 'Generative AI', 'Machine Learning', 'Search Engine'], ['Banking', 'Credit', 'Finance', 'Financial Services'], ['Artificial Intelligence (AI)', 'Chatbot', 'Generative AI', 'Machine Learning', 'Natural Language Processing', 'Search Engine'], ['Banking', 'Communities', 'Finance', 'Financial Services'], ['Consumer', 'Consumer Electronics', 'Consumer Software', 'Information Technology'], ['Bitcoin', 'Blockchain', 'Cryptocurrency', 'FinTech', 'Tradi

In [32]:

print("avant : ", df.shape)
df, flag_cybersecurity = field_extraction('cybersecurity', df)
print("après  : ", df.shape, " flag:", flag_cybersecurity)
print(df.head(10))


avant :  (3752729, 17)
après  :  (8994, 17)  flag: True
                                         uuid        name    rank  \
1480808  2b74e337-45a9-4bcc-a798-2933b3989173       Voxel    74.0   
1611967  b1a43f10-5a94-4bd1-b8db-4662e3f6006d       World   116.0   
888599   76816673-8f24-4212-b1aa-f2a6fdd47935        Aura   401.0   
74714    3e9d5f7e-7301-b645-66af-eb756892af3a     Zscaler   821.0   
412966   27009f53-5f8d-b64f-20fb-e8e31fa5fdff  Silverfort   824.0   
994231   4d70e0a8-951e-4aac-bc45-c382ca5d6f31        Veza   873.0   
154230   41544b58-79b0-d26a-1e2f-5b54656aa1b5    Telegram  1011.0   
2638343  5b5efaec-1bc5-4237-9466-cbbbe8576f84        Qodo  1105.0   
3599854  ab1d75a4-1883-4d09-b1e8-927834edbf39       Edera  1144.0   
304343   eafb244d-aac2-7203-dd8f-44d777cce8da        Snyk  1179.0   

        country_code         region           city     status  \
1480808          USA     California  San Francisco  operating   
1611967          DEU         Berlin         Berlin  op

In [33]:
keywords = ['cyber', 'security', 'cybersecurity']
mask = df['category_groups'].apply(
    lambda lst: isinstance(lst, list) and any(k.lower() in ' '.join(lst).lower() for k in keywords)
)
print("matched in category_groups:", mask.sum())
print(df[mask].head(10))
# aussi vérifier descriptions
desc_mask = df['short_description'].astype(str).str.contains('|'.join(keywords), case=False, na=False)
print("matched in short_description:", desc_mask.sum())
print(df[desc_mask].head(10))

matched in category_groups: 3760
                                         uuid           name    rank  \
888599   76816673-8f24-4212-b1aa-f2a6fdd47935           Aura   401.0   
74714    3e9d5f7e-7301-b645-66af-eb756892af3a        Zscaler   821.0   
412966   27009f53-5f8d-b64f-20fb-e8e31fa5fdff     Silverfort   824.0   
994231   4d70e0a8-951e-4aac-bc45-c382ca5d6f31           Veza   873.0   
3599854  ab1d75a4-1883-4d09-b1e8-927834edbf39          Edera  1144.0   
304343   eafb244d-aac2-7203-dd8f-44d777cce8da           Snyk  1179.0   
63469    0ccbefd4-1a6c-3c53-8302-e0673d15f129        Trulioo  1367.0   
3659928  d9c814e2-fff0-44ce-8846-66df480a94e0           Noma  1494.0   
3769738  9062647d-9dc5-41de-826c-add5a0a6aa72          Astor  1589.0   
2865940  b7f3c910-a90d-43f8-a069-76ef422c2807  Look Up Space  1943.0   

        country_code            region       city     status  \
888599           USA     Massachusetts     Boston  operating   
74714            USA        California   San J

In [34]:
df.head()

Unnamed: 0,uuid,name,rank,country_code,region,city,status,short_description,category_groups,category_groups_list,total_funding_usd,total_funding,founded_on,closed_on,employee_count,linkedin_url,primary_role
1480808,2b74e337-45a9-4bcc-a798-2933b3989173,Voxel,74.0,USA,California,San Francisco,operating,Voxel enhances workplace safety and operationa...,"[Artificial Intelligence (AI), Computer Vision...","Artificial Intelligence (AI),Data and Analytic...",74000000.0,74000000.0,2020-10-01,,101-250,https://www.linkedin.com/company/voxelai,company
1611967,b1a43f10-5a94-4bd1-b8db-4662e3f6006d,World,116.0,DEU,Berlin,Berlin,operating,World connects users through a privacy-focused...,"[Blockchain, Cryptocurrency, Finance, FinTech,...","Blockchain and Cryptocurrency,Financial Servic...",379000000.0,379000000.0,2019-01-01,,101-250,https://www.linkedin.com/company/worldofficial,company
888599,76816673-8f24-4212-b1aa-f2a6fdd47935,Aura,401.0,USA,Massachusetts,Boston,operating,Aura provides AI-powered security solutions th...,"[Cyber Security, Information Technology, Netwo...","Information Technology,Privacy and Security",662650000.0,662650000.0,2017-01-01,,501-1000,https://www.linkedin.com/company/auracompany,company
74714,3e9d5f7e-7301-b645-66af-eb756892af3a,Zscaler,821.0,USA,California,San Jose,ipo,Zscaler is a global cloud-based information se...,"[Cloud Security, Cyber Security, Enterprise So...","Information Technology,Privacy and Security,So...",1670695000.0,1670695000.0,2008-01-01,,5001-10000,https://www.linkedin.com/company/zscaler,company
412966,27009f53-5f8d-b64f-20fb-e8e31fa5fdff,Silverfort,824.0,ISR,Tel Aviv,Tel Aviv,operating,Silverfort is the identity security platform t...,"[Cyber Security, Enterprise Software, Identity...","Information Technology,Privacy and Security,So...",222500000.0,222500000.0,2016-01-01,,251-500,https://www.linkedin.com/company/silverfort,company


In [35]:
df['short_description'].values

array(['Voxel enhances workplace safety and operational efficiency by transforming existing security cameras into intelligent monitoring systems.',
       'World connects users through a privacy-focused network with secure digital asset management.',
       'Aura provides AI-powered security solutions that offer identity theft protection, credit monitoring, and online privacy for individuals.',
       ...,
       'Medus Llc. creates imaging software for medical, veterinary, security, and testing industries with secure remote data access.',
       "4600Boehm offers payment integrity and healthcare recovery services, specializing in workers' compensation claims and data analysis.",
       'COSA provides intelligence solutions to global businesses that integrate security, compliance, integrity, and continuity into operations.'],
      dtype=object)

### Create Companies and Technologies classes

#### Ranking

I personally appreciate the ranking that you provide for each company. However, I did not quite understand what's the magic behind it. Is there any chance to get some more insight/details, also considering that we do have an NDA in place?

- Crunchbase rank uses Crunchbase’s intelligent algorithms to score and rank entities (e.g. Company, People, Investors, etc.).
- The algorithms take into account many different variables; ranging from funding events, the entity’s strength of relationships with other entities in the Crunchbase ecosystem, the level of engagement from our website, news articles, and acquisitions.

    - A company’s Rank is fluid and subject to rising and decaying over time with time-sensitive events. Events such as product launches, funding events, leadership changes, and news affect a company’s Crunchbase Rank.


- The Crunchbase rank shows where an entity falls in the Crunchbase database relative to all other entities in that entity type (i.e. if searching for companies, you will see where a specific company ranks relative to all other companies). An entity with a Crunchbase Rank of 1 has the highest rank relative to all other entities of that type.

I would also suggest leveraging our Trend Score - 7 Day, 30 Day, 90 Day (e.g. Company, People, Investors, etc.)

- While Rank shows context, Crunchbase Trend Score demonstrates activity. A company’s rank will change based on activity (fundraising, news, etc.) and Trend Score is an indicator of how much their rank is changing at any given time.
- Crunchbase Trend Score tracks the fluctuations in Rank. As a company’s rank changes, so do its Trend Score.
- Trend Score measures the rate of a company’s activity on a 20-point (+10 <-> -10) scale. Scores closer to +10 mean it’s moving up in rank much faster compared to their peers. Scores closer to -10 mean it’s moving down.
- For example, a company that announces its first funding round will likely experience a jump in its Rank, pushing its Trend Score up as its page views, article counts, funding amount, team members, etc., begin to increase.


## Create graph and dictionaries <a class="anchor" id="create_graph"></a>

In [58]:
# Extracts the dictionaries of Companies and Technologies from the dataset and create the network
df_limited = df  # Par défaut, utilise tout le DataFrame
[dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)

In [59]:
print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")

We have 8836 companies and 658 technologies


## Save dictionaries and network <a class="anchor" id="save"></a>

In [60]:
# Save dictionaries in a pickle files

# if flag_cybersecurity==False: # all fields
#     name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
#     name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
# else: # only companies in cybersecurity
#     name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
#     name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"
if flag_cybersecurity==False: # all fields
    name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
    name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
else: # only companies in cybersecurity
    name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
    name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

# companies
with open(name_file_com, "wb") as f:
    pickle.dump(dict_companies, f)

#technologies
with open(name_file_tech, "wb") as f:
    pickle.dump(dict_tech, f)

In [61]:
# Save the bipartite graph as gpickle:

# if flag_cybersecurity==False: # all fields
#     name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'                                     
# else: # only companies in cybersecurity
#     name_file_graph = 'savings/networks/cybersecurity_comp_'+ str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'
                                                       
# nx.write_gpickle(B, name_file_graph)

# Save the bipartite graph as gpickle:

# Save the bipartite graph as gpickle:
if flag_cybersecurity == False:  # all fields
    name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'
else:  # only companies in cybersecurity
    name_file_graph = 'savings/networks/cybersecurity_comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'

# Sauvegarder le graphe avec pickle
with open(name_file_graph, "wb") as f:
    pickle.dump(B, f)

print(f"Graphe sauvegardé dans {name_file_graph}")

Graphe sauvegardé dans savings/networks/cybersecurity_comp_8836_tech_658.gpickle


## Quick loop  <a class="anchor" id="loop0"></a>

With quick loop, we mean that we do all the step of the previous sections, in order to update the dictionaries, for all size, in only one loop.

In this part, you won't see many comments because everything has been already explained before :)

In [62]:
limits = [2443]
flag_cybersecurity = True

In [63]:
# for i in limits:
#     df_limited = df[:i] # set limits
#     [dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)
#     print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")
    
#     # Save dictionaries in a pickle files

#     if flag_cybersecurity==False: # all fields
#         name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
#         name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
#     else: # only companies in cybersecurity
#         name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
#         name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

#     # companies
#     with open(name_file_com, "wb") as f:
#         pickle.dump(dict_companies, f)

#     #technologies
#     with open(name_file_tech, "wb") as f:
#         pickle.dump(dict_tech, f)
        
#     if flag_cybersecurity==False: # all fields
#         name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'                                     
#     else: # only companies in cybersecurity
#         name_file_graph = 'savings/networks/cybersecurity_comp_'+ str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'

#     nx.write_gpickle(B, name_file_graph)

# --------------------NO LONGER WRITE_GPICKLE AVAILABLE--------------------

#Alternative way to save the graph with pickle

for i in limits:
    df_limited = df[:i]  # set limits
    [dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)
    print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")
    
    # Save dictionaries in pickle files
    if flag_cybersecurity == False:  # all fields
        name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
        name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
    else:  # only companies in cybersecurity
        name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
        name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

    # Save companies dictionary
    with open(name_file_com, "wb") as f:
        pickle.dump(dict_companies, f)

    # Save technologies dictionary
    with open(name_file_tech, "wb") as f:
        pickle.dump(dict_tech, f)
        
    # Save the bipartite graph
    if flag_cybersecurity == False:  # all fields
        name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'
    else:  # only companies in cybersecurity
        name_file_graph = 'savings/networks/cybersecurity_comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'

    # Save the graph using pickle
    with open(name_file_graph, "wb") as f:
        pickle.dump(B, f)

    print(f"Graphe sauvegardé dans {name_file_graph}")

We have 2432 companies and 522 technologies
Graphe sauvegardé dans savings/networks/cybersecurity_comp_2432_tech_522.gpickle
