# OSS Collaboration Graph Platform
### by Agroskin Alexander

Uncomment the following lines to install required libraries:

In [1]:
# !pip install pyvis pandas
# !pip install request  # necessary only if you are planning on updating the files 

I've already saved necessary data to the `authors.json` file. You can uncomment the following cell if you want to request updated contributor data using GitHub API. An OAuth token is advised to bypass the rate limiter, although the function will work without it

In [2]:
# from preparation import prepare_data

# prepare_data("your_token_here")

Next we will read the data from `authors.json` and create a `pandas` `DataFrame` from it:

In [3]:
import pandas as pd
import json

with open('authors.json', 'r') as au:
    authors = json.load(au)
data = pd.DataFrame.from_dict(authors, orient='index')
data['files'] = data['files'].map(set)
data['total_time'] = data['maxdate'] - data['mindate']
data.head()

Unnamed: 0,total,files,maxdate,mindate,total_time
Dan Abramov,1508,"{scripts/tasks/flow-ci.js, packages/react-dom/...",1617117597,1401056969,216060628
Brian Vaughn,1328,{packages/react-devtools-extensions/shared/ico...,1618009933,1481141166,136868767
Sophie Alpert,875,{src/renderers/shared/stack/reconciler/instant...,1593515345,1504735814,88779531
Andrew Clark,857,"{scripts/tasks/flow-ci.js, scripts/release/REA...",1617932735,1446750778,171181957
Paul O’Shannessy,821,"{React/core/__tests__/ReactIdentity-test.js, R...",1567119887,1369856771,197263116


Resulting `DataFrame` has the following columns: 
* `total`: total number of commits to the project
* `files`: set of files that the user has changed
* `maxdate`: date of the last commit of this user (in UNIX time format) 
* `mindate`: date of the first commit of this user (in UNIX time format) 
* `total_time`: total time (in UNIX epoch seconds) between the first and the last commit of this user


Now we will build a graph network using our `DataFrame`. We connect two contributors with an edge if they have a more than average overlap in contributions. The length of each edge is inversely proportional to the number of common files multiplied by the length of the time period two contributors have spent on the project together (I consider "time spent on project" as the time period between the first and the last commit).

The resulting equation for the length between nodes (contributors) $a$ and $b$ is as follows:
$$ len(a, b)=\frac{1}{(inter_{a,b} - mean_{inter})\cdot(1 + (time_{a,b} - shift_{time})\cdot 10^{-8})}$$

Where:
* $inter_{a,b}$ is the number of files both contributos have forked on
* $mean_{inter}$ is the average number of overlapping files
* $time_{a,b}$ is the overlap of time periods of $a$ and $b$ (in UNIX epoch seconds)
* $shift_{time}$ is the average overlap time shift, which I've decided to set to two years (again in UNIX seconds)

Additionally, the size of each node is proportional to the number of contributions and totat time spent on the project:
$$size(a)=total_a * total_time_a * 10^{-8}$$

In the end, all formulae and metrics are rather arbitrary, but the resulting picture displays the relationship between contributors relatively clearly.

For visualization I've chosen the `pyvis` library, as it allows interactive graph viewer embedding. If your Jupyter viewer does not support embedding of `.html` files, you can turn off notebook mode (or just open `react_graph.html`) and the network will be shown in your browser.

In [4]:
import math
from itertools import combinations
from pyvis.network import Network


# Calculates the length function
def get_length(first, second) -> float:
    # Setting mean file and time intersections
    two_years = 31536000 * 2  # I'm using two years as an average time spent working together
    mean_intersection = 15  # calculated based on the average number of file intersections
    inter_len = len(first.files & second.files) - mean_intersection
    if inter_len <= 0:
        return 0
    fbeg, fend, sbeg, send = first.mindate, first.maxdate, second.mindate, second.maxdate
    if fbeg > sbeg:  #  Ensuring that fbeg <= sbeg
        fbeg, fend, sbeg, send = sbeg, send, fbeg, fend
    if fend <= sbeg:
        together = 0
    elif fbeg <= sbeg <= send <= fend:
        together = send - sbeg
    elif fbeg <= sbeg <= fend <= send:
        together = fend - sbeg
    return 1 / (inter_len * (1 + (together - two_years) / 100000000))


# Set notebook=False if you want to open the graph viewer externally
net = Network(notebook=True, width=1000, height=800, font_color=True)
net.add_nodes(data.index, value=[i for i in data['total'] * data['total_time'] / 100000000],
              color=['#1520a6' for i in range(50)])
for first, second in combinations(data.index, 2):
    length = get_length(data.loc[first], data.loc[second])
    if length > 0:  # Adding an edge only if the 
        net.add_edge(first, second, length=length, color='orange')
net.barnes_hut(gravity=-30000, central_gravity=1.5, spring_strength=0.002)
net.show('react_graph.html')