# Learning from networks - Stonks

Start by importing the libraries we will use throughout the notebook.

In [3]:
import networkx as nx
import extended_networkx as ex

## Load graph and compute market capitalization

First of all let's start by loading the graph file, then we compute what we called "market capitalization" which is how much an node has been bought.

In [28]:
# Load Graph
G = nx.read_gml("out_graph.gml")
G_undirected = G.to_undirected() # Needed to compute the number of connected components

print(f"The graph contains {len(G.nodes())} nodes and {len(G.edges())} edges.")
print(f"There are {nx.number_connected_components(G_undirected)} connected components.")
print(f"Sizes: {[len(c) for c in sorted(nx.connected_components(G_undirected), key=len, reverse=True)]}")

#print(f"The diameter of the graph is {nx.diameter(G)}.")


def compute_capitalization(G: nx.Graph):
    """
    Adds the 'capitalization' attribute to every node, which is the sum of the incoming edges weights.
    """
    for node in G.nodes():
        capitalization = 0
        for edge in G.in_edges(node):
            capitalization += G.get_edge_data(*edge)["weight"]
        G.nodes[node]["capitalization"] = capitalization

compute_capitalization(G)

The graph contains 25396 nodes and 396114 edges.
There are 6 connected components.
Sizes: [25189, 125, 38, 18, 13, 13]


Now let's print the top 20 capitalization nodes.

In [14]:
k = 20
print(f"Top {k} nodes with highest capitalization: {ex.max_k_nodes(G, k, 'capitalization')}")

Top 20 nodes with highest capitalization: ['CPIN', 'AAPL', 'MSFT', 'AMZN', 'ADRO', 'GOOGL', 'FB', 'GOOG', 'TSLA', 'NVDA', 'JPM', 'UNVR', 'JNJ', 'V', 'UNH', 'HD', 'PG', 'BAC', 'MA', 'PYPL']


## Node-level features

### Betweenness centrality

We try to compute betweenness centralities of the nodes. The graph is too big to compute the exact betweenness centrality of each node, so we only use a small percentage of the nodes.

In [6]:
b_centralities = ex.betweenness_centrality_percent(G, percentage=0.02)
print(sorted(b_centralities.items(), key=lambda t: t[1], reverse=True)[:k])

[('MCRO', 0.0009539195687510155), ('VWO', 0.0005239916465104104), ('FRDM', 0.000415247456603121), ('CPI', 0.00039124606040215494), ('VGK', 0.00035217582645689314), ('GEM', 0.0003461948960119922), ('IPAC', 0.0003433986168429476), ('SMCP', 0.00030976559239305024), ('HNDL', 0.00027388000972364474), ('IEUR', 0.00025360698574807146), ('SCZ', 0.00022440140331582802), ('ITOT', 0.00020801210040837225), ('CRBN', 0.00011767674836395966), ('RALS', 8.404372391406228e-05), ('EEM', 7.029535133292639e-05), ('UPRO', 6.81204675347806e-05), ('IEMG', 6.384837435985138e-05), ('FAB', 6.221721151124204e-05), ('SMIN', 6.004232771309625e-05), ('HART', 5.3828374004108276e-05)]


### Clustering coefficients

The computation for the exact clustering coefficient is doable ...

In [7]:
nodes_clustering_coeff = nx.clustering(G, weight="weight")
# Print top k nodes with highest clustering coefficient
print(f"Top {k} nodes by clustering coefficient") 
print(sorted(nodes_clustering_coeff.items(), key=lambda t: t[1], reverse=True)[:k])
print(f"Global clustering coefficient: {nx.transitivity(G)}")


Top 20 nodes by clustering coefficient
[('RWVG', 2.220278602733271e-05), ('UNA', 1.9880340687835788e-05), ('SSI', 1.5219529225139582e-05), ('HPG', 1.2501223578956097e-05), ('PDR', 1.2427193087025836e-05), ('CRHl', 1.2137344349100606e-05), ('VIC', 1.1967758246053812e-05), ('KRZ', 1.1909451311225256e-05), ('BCM', 1.1672895376409549e-05), ('RWGV', 1.1043518322844037e-05), ('VHM', 1.0180368464746611e-05), ('VCI', 8.455830606685144e-06), ('NVL', 8.304115970363736e-06), ('HSG', 8.276992510958094e-06), ('GEX', 7.798399073501852e-06), ('EXK', 7.639429907146357e-06), ('VCB', 7.093919710666592e-06), ('NOVO', 6.784985924541269e-06), ('JETl', 6.676593093953429e-06), ('DGC', 6.654628072496844e-06)]
Global clustering coefficient: 9.24110477412097e-05


### Closeness Centrality

To compute closeness centrality we use an algotithm found on the internet which basically computes the shortest paths using the Floyd — Warshall Method and uses the shortest path Matrix to calculate the closeness metric for each node. 
We devised our own algorithm to draw a connected sample of `k` nodes from the graph `G`: `connected_random_subgraph(G, n)`.

In [10]:
sub_G = ex.connected_random_subgraph(G, 6000)
c_centralities = ex.closeness_centrality_matrix(sub_G)
print(sorted(c_centralities.items(), key=lambda t: t[1], reverse=True)[:k])

There are 1 components with more than 6000 nodes.
[(2033, 0.812130823621288), (1024, 0.3520103879802201), (31, 0.14168102151613418), (2176, 0.13927210843714138), (4358, 0.09076650341814653), (641, 0.01308046959666535), (2933, 0.010767305836054453), (2320, 0.006067117182906897), (1372, 0.002031822950405895), (3723, 0.0014133657038335524), (3104, 1.2274503253546062e-05), (1020, 4.4567811360105214e-06), (3051, 1.3495951867394154e-06), (3307, 1.335601104949575e-06), (4317, 1.2073640788589837e-06), (2931, 1.0708925772847575e-06), (4903, 1.0708925772847575e-06), (4258, 8.701400748737714e-07), (1870, 8.047489982025647e-07), (5851, 7.523140401305416e-07)]


Comparing against nx.closeness_centrality

In [9]:
c_centralities = nx.closeness_centrality(sub_G)
print(sorted(c_centralities.items(), key=lambda t: t[1], reverse=True)[:k])

[('AAPL', 0.018545019557766967), ('INTC', 0.017781809788810953), ('GOOGL', 0.017350211049160874), ('NVDA', 0.0169144044332592), ('ADM', 0.016803842307051176), ('DTE', 0.016325438528502097), ('GILD', 0.0161368358140991), ('LH', 0.015928667238932347), ('ABBV', 0.01588171199264918), ('FB', 0.015542820355116657), ('VZ', 0.0154890012532953), ('CRM', 0.015352687980296457), ('CMCSA', 0.015313103742710117), ('PFE', 0.015256511005802555), ('IP', 0.015096855765269056), ('AVGO', 0.015003763253471539), ('IBM', 0.01491003708951492), ('PPL', 0.014672380037195742), ('V', 0.01461897784969174), ('COST', 0.014402400400066679)]
