<div style="font-size:30px" align="center"> <b> Using Node Embeddings to Study Open Source Software Collaborations </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon Kramer, UVA Biocomplexity Institute, OSS DSPG 2021 </b> </div>

<br>

### Setup  

In this notebook, we use `node2vec` to study open source software collaborations. First, let's load all of our modules and our node and edge data from the PostgreSQL database.

In [19]:
%matplotlib inline

# load modules 
import warnings
from datetime import datetime
#from text_unidecode import unidecode
from collections import deque
import os
import multiprocessing
import psycopg2 as pg
import pandas.io.sql as psql
import pandas as pd
from sklearn.manifold import TSNE
import numpy as np
import networkx as nx
from gensim.models import Word2Vec
from node2vec import Node2Vec
import altair as alt
warnings.filterwarnings('ignore')

# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = 'sz3wr', 
                        password = 'sz3wrsz3wr')

edgelist_data = '''SELECT ctr1, ctr2, repo_wts as weight 
                   FROM gh_sna.sna_intl_ctr_edgelist_0819 by LIMIT 50000'''
nodelist_data = '''select * from gh.sna_ctr_sectors '''

# convert to a dataframe, show how many missing we have (none)
edgelist_data = pd.read_sql_query(edgelist_data, con=connection)
nodelist_data = pd.read_sql_query(nodelist_data, con=connection)

# convert the edgelist to a graph 
graph = nx.from_pandas_edgelist(edgelist_data, source='ctr1', target='ctr2', edge_attr='weight')

print("Node count:", graph.number_of_nodes(), "- Edge count:", graph.number_of_edges())

Node count: 2830 - Edge count: 50000


In [None]:
#figure out env issue
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

### Training node2vec

Next, we use `node2vec` to embed the network subset and save the model for later visualization.

In [20]:
cores_available = multiprocessing.cpu_count() - 1

# train the graph with node2vec 
print("Started at:", datetime.now())
# node2vec
node2vec = Node2Vec(graph, dimensions=20, walk_length=16, num_walks=100, workers=cores_available)
# extract model
model = node2vec.fit(window=10, min_count=1)
print("Finished at:", datetime.now())

os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
model.save("node2vec_test.model")

# Running large graphs 
# https://github.com/eliorc/node2vec
# https://github.com/eliorc/node2vec/blob/master/example.py
#node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200, workers=4, temp_folder="/mnt/tmp_data")

Started at: 2021-07-22 16:43:03.969171


Computing transition probabilities:   0%|          | 0/2830 [00:00<?, ?it/s]

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

### Incorporating Node Attributes

Once we have the node embedding data (in a multidimensional space), we can make some interesting visualizations. Before we do that, we are going to run some basic network centrality measures (degree centrality, betweenness centrality, and page rank) to supplment our visualizations. Second, we will need to reduce the dimensions of the node embeddings by using t-distributed stochastic neighbor embedding (`TSNE` from `scikit-learn`). Lastly, we are going to join our node attributes (companies, countries and sectors) to help our final analyses.  

AttributeError: 'Word2Vec' object has no attribute 'vocab'

In [21]:
# import the node2vec model 
os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
model = Word2Vec.load("node2vec_test.model")

# run centrality measures 
deg_cent = nx.degree_centrality(graph)
btw_cent = nx.betweenness_centrality(graph, normalized = True, endpoints = False)
page_rank = nx.pagerank(graph, alpha = 0.8)
deg_cent_df = pd.DataFrame(deg_cent.items(), columns=['login', 'deg_cent'])
btw_cent_df = pd.DataFrame(btw_cent.items(), columns=['login', 'btw_cent'])
page_rank_df = pd.DataFrame(page_rank.items(), columns=['login', 'page_rank'])
deg_cent_df = btw_cent_df.join(deg_cent_df.set_index('login'), on='login', how='left')
cent_measures = page_rank_df.join(deg_cent_df.set_index('login'), on='login', how='left')
cent_measures


Unnamed: 0,login,page_rank,btw_cent,deg_cent
0,jstrachan,0.000569,3.749807e-07,0.001767
1,rawlingsj,0.000515,3.749807e-07,0.001767
2,hectorsector,0.000660,1.416594e-06,0.001767
3,hollenberry,0.000487,1.666581e-07,0.001414
4,crichID,0.000489,0.000000e+00,0.001060
...,...,...,...,...
2825,venilnoronha,0.000091,0.000000e+00,0.000707
2826,kulikala,0.000194,0.000000e+00,0.008484
2827,rriemann,0.000195,0.000000e+00,0.001060
2828,ebassi,0.000086,0.000000e+00,0.000353


In [31]:
# join all of the node attributes together for data viz 
vocab = list(model.wv.index_to_key) #figure this out
model_x = model.wv[vocab]
model_tsne = TSNE(n_components=2)
model_tsne_x = model_tsne.fit_transform(model_x)
model_tsne_x
tsne_df = pd.DataFrame(model_tsne_x, index=vocab, columns=['x', 'y'])
tsne_df["login"] = vocab
tsne_df = tsne_df.join(nodelist_data.set_index('login'), on='login', how='left')
tsne_df = cent_measures.join(tsne_df.set_index('login'), on='login', how='left')
tsne_df

Unnamed: 0,login,page_rank,btw_cent,deg_cent,x,y,sector,city_info,cc_multiple,cc_di,cc_viz,raw_location,email,company_original,company_cleaned
0,jstrachan,0.000569,3.749807e-07,0.001767,10.506099,-50.607338,business,,gb,single,gb,"Mells, UK",james.strachan@gmail.com,CloudBees,cloudbees
1,rawlingsj,0.000515,3.749807e-07,0.001767,10.506712,-50.607876,business,cardiff,gb,single,gb,Cardiff,rawlingsj80@gmail.com,CloudBees,cloudbees
2,hectorsector,0.000660,1.416594e-06,0.001767,10.332358,-44.333717,null/missing,orlando,us,single,us,"Orlando, FL",,,
3,hollenberry,0.000487,1.666581e-07,0.001414,10.331035,-44.329735,business,san diego,us,single,us,"San Diego, CA",,GitHub,github
4,crichID,0.000489,0.000000e+00,0.001060,10.329580,-44.331402,business,,us,single,us,NC,,GitHub,github
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2825,venilnoronha,0.000091,0.000000e+00,0.000707,-24.056061,-17.559462,business,palo alto,us,single,us,"Palo Alto, California",hello@venilnoronha.io,@vmware,vmware
2826,kulikala,0.000194,0.000000e+00,0.008484,8.125417,22.840578,null/missing,tokyo,jp,single,jp,Tokyo/Japan,,,
2827,rriemann,0.000195,0.000000e+00,0.001060,-39.847633,-3.097565,null/missing,lyon,fr,single,fr,Lyon,robert@riemann.cc,,
2828,ebassi,0.000086,0.000000e+00,0.000353,14.195987,-32.491650,null/missing,london,gb,single,gb,"London, UK",ebassi@gmail.com,,


### Node Embedding Visualizations 

First, we can visualize these embeddings by sector. Given that most the nodes are from the 

In [32]:
domain = ['academic', 'business', 'non-profit', 'government']# , 'not classified', 'null/missing']
range_ = ['crimson', 'teal', 'darkorange', 'darkblue'] #, 'lightgrey', 'lightgrey']

alt.Chart(tsne_df,title="Node Embedding of OSS Collaboration Networks (by Sector)").mark_circle().encode(
   x='x', y='y', 
    color=alt.Color('sector', scale=alt.Scale(domain=domain, range=range_)),
    size=alt.Size('page_rank'),
    tooltip=['login', 'sector', 'city_info', 'cc_viz', 'company_original', 'company_cleaned']
).interactive().properties(
    width=700,
    height=500
)

In [33]:
domain = ['microsoft', 'google', 'red hat', 'ibm', 'facebook', 'intel', 'thoughtworks', 'alibaba', 'amazon', 'databricks']

alt.Chart(tsne_df, title="Node Embedding of OSS Collaboration Networks (Top Companies)").mark_circle().encode(
   x='x', y='y', 
    color=alt.Color('company_cleaned', scale=alt.Scale(domain=domain)),
    size=alt.Size('page_rank'),
    tooltip=['login', 'sector', 'city_info', 'cc_viz', 'company_original', 'company_cleaned']
).interactive().properties(
    width=700,
    height=500
)

In [34]:
alt.Chart(tsne_df, title="Node Embedding of OSS Collaboration Networks (by Company)").mark_circle(size=150).encode(
   x='x', y='y', 
    color='company_cleaned',#alt.Color('sector', scale=alt.Scale(domain=domain, range=range_)),
    size=alt.Size('page_rank'),
    tooltip=['login', 'sector', 'city_info', 'cc_viz', 'company_original', 'company_cleaned']
).interactive().properties(
    width=700,
    height=500
)

In [35]:
alt.Chart(tsne_df,title="Node Embedding of OSS Collaboration Networks (by Country)").mark_circle(size=150).encode(
   x='x', y='y', 
    color='cc_viz',
    size=alt.Size('page_rank'),
    tooltip=['login', 'sector', 'city_info', 'cc_viz', 'company_original', 'company_cleaned']
).interactive().properties(
    width=700,
    height=500
)

In [4]:
for node, _ in model.most_similar('rasbt'):
    # Show only players
    if len(node) > 3:
        print(node)

buaaliyi
rinugun
yuanbyu
EronWright
charlesnicholson
sugyan
darrengarvey
markmcd
zafartahirov
vahidk


### Future directions

https://github.com/shenweichen/GraphEmbedding <<< 

node2vec 
struc2vec

[Node2Vec Tutorial](https://github.com/eliorc/Medium/blob/master/Nod2Vec-FIFA17-Example.ipynb)
