# Table of Contents
* &nbsp;
	* &nbsp;
		* [Network characteristics](#Network-characteristics)


### Network characteristics

The network of verdicts has the following characteristics

1. It is directed since each link either references or is a reference to another verdict
2. It has a simple temporal structure where each verdict can only reference back in time
3. Once eastablished, a node is static and cannot change based on new nodes being added to the network
4. It is acyclic, since two verdicts referencing each other contradicts the statements in point 1 and 2

If we assume that the likelihood of a link being made is independent from the distance in time the only thing that is needed for taking the temporal structure into account is pruning the network to the timestamp of the link being validated, leaving a network as a directed acyclic graph where it is guaranteed that the newest node only has outgoing links. 
<figure style="display: inline-block; float: right; text-align: center; max-width: 50%; margin: 0 20px">
<img src="Pictures/DAG.png" style="float: right; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Possible triadic configurations for a directed, acyclic graph where $x$ and $y$ is the possible link being investigated. A and B are the only possibilities if $x$ cannot have incoming links.</figcaption>
</figure>
Any three nodes in a general DAG can be configured one of four ways and the additional constraint of the newest node only having outgoing links reduces the possible triadic patterns to two as seen in the figure on the right. A more general directed network that allowed for reciprocal links between nodes would have a total of 36 different triadic configurations.

### Link prediction methods in directed networks
Most methods presented for link prediction has focused on undirected networks where the predictor is some variant of the common neighbours formulation with some weighting scheme. These have proved to be efficient in many real networks and even perform well in directed networks when discarding the directedness [](#cite-lu). **Show the entire unsupervised framework here?**
Unsupervised methods for directed methods also exist and in general they try to exploit the richer structural information found in directed networks to make better predictions, either by defining specific in- and out-networks of interest or looking at the likelihood of the triadic configuration that would be created by a predicted link.

## Features and similarity indices

### Node distance
A useful feature when estimating the likelihood of a node pair to create a link can be the shortest distance between the two nodes, which has been shown by other authors to be useful in different networks. On its own it might not be particularly useful, but when using it as a feature in a supervised model where it is combined with other topological, temporal or meta-features.
That it can be useful in the verdict network can be seen in the figure below where the average shortest distance for connected and non-connected nodes in the network.

<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 20px">
<img src="Pictures/shortest path dist.png" style="max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Distribution of shortest paths between connected and non-connected nodes. Graphs are normalized to sample size, so this graph does not take into account class imbalance.</figcaption>
</figure>

### Common Neighbors

Common neighbors or the intersection of the 1-neighborhoods of the nodes $x$ and $y$ is the most commonly used neighborhood in link prediction and it has been shown that the likelihood of a link between two nodes correlates highly with the distance between the nodes as well as the number of neighbors shared (Link prediction, fair evaluation) (Lu, survey).
Formally the score is found by

$$
s_{x,y} = | \Gamma(x) \cup \Gamma(y) |
$$

where $\Gamma(x)$ is the function returning the 1-neighborhood of $x$. For a directed graph either out- or in-neighborhoods can be used and the choice depends on the network. In this case where $x$ will always be a newly inserted node with no incoming links both the in- and out-neighbors of $y$ is used, as seen in the figure.

The usefulness of this index can be seen in the figure where the distribution of common neighbors for edges and a sub-sample of non-edges is shown. Real edges are far more likely to have common neighbors than non-edges, however there is still a group of edges that do not share any common neighbors, effectively creating an upper limit of how many edges can be predicted with this index.

<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 auto">
<img src="Pictures/Common neighbors distribution.png" style="margin: 0 auto; max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Distribution of common neighbors for all edges and subsample of non-edges of size 43694.</figcaption>
</figure>

### Triadic closeness
In a network a triad is a combination of three nodes, $x$, $z$ and $y$, along with the possible links between them. Every combination is a specific type of triad and for a directed graph there are 16 different ways, but this increases to 36 when we consider that $z$ is a common neighbor to $x$ and $y$ and that for link prediction the order matters, the triad consisting of the links ${(x,y), \ (x,z), \ (y,z)}$ is not the same as  ${(y,x), \ (x,z), \ (y,z)}$. The different combinations are seen in the figure where they are classified based on the link between $x$ and $y$, with 1-9 meaning no link, 11-19 meaning a link from $x$ to $y$, 21-29 meaning a link from $y$ to $x$ and 31-39 being a reciprocal link between $x$ and $y$.

<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 auto">
<img src="Pictures/open triads.png" style="margin: 0 auto; max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Possible open triadic configurations for a directed graph, with the category label below the triad.</figcaption>
</figure>
<figure style="display: inline-block; float: right; text-align: center; max-width: 45%; margin: 0 20px">
<img src="Pictures/closed triads.png" style="float: right; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Possible closed triadic configurations for a directed graph, with the category label below the triad.</figcaption>
</figure>
Triadic closeness is an unsupervised method proposed by Schall [](#cite-schallTC) that exploits the fact that some triadic patterns are more likely to appear in a given network than others, so if you have a possible link between nodes $x$ and $y$ that have the common neighbor $z$ the likelihood of the created triad appearing in the network can be used as a score. In other words, the likelihood for a link between $x$ and $y$ depends on how common the triad formed from adding the link is compared to how common the triad formed by the unconnected $x$ and $y$.

This is related to [motif analysis](https://en.wikipedia.org/wiki/Network_motif), specifically motif analysis where every motif is a sub-graph of size 3 and Schall [](#cite-schallTC) produced good results on data from Twitter, GitHub and Google+.

The actual triadic closeness score is calculated as

$$
s_{x,y} = \sum_{z \ \in \Gamma(x) \cup \Gamma(y)} \frac{\text{F}(\delta(x,y,z) + 10) + \text{F}(\delta(x,y,z) + 30)}{\text{F}(\delta(x,y,z))}
$$

where $s_{x,y}$ is the score for the $x,y$ node pair, $\Gamma(x)$ returns the neighborhood of $x$, $\delta(x,z,y)$ returns the triad produced by the nodes $x$, $y$ and $z$ and $\text{F}(t)$ returns the frequency of the triad $t$. Note that the neighborhood consists of both in- and out-going links.

Since a triad requires that the nodes $x$ and $y$ are joined through an extra node $z$ nodes that are linked but do not share neighbors will not be found by this method. If the network being examined has important structure beyond the 1-neighborhood more general motif analysis methods can be used.

### Common referrers
<figure style="display: inline-block; float: right; text-align: center; max-width: 33%; margin: 0 20px">
<img src="Pictures/Common referrers.png" style="float: right; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Common referrer neighborhood for a possible connection between $x$ and $y$ where the common referrers are the $z$-nodes marked in red.</figcaption>
</figure>
Common referrers is an expansion of the common neighbors metric which assumes that if $x$ is connected to $z$ through $u$, then $x$ is also more likely to connect to other nodes that $z$ is connected to. Specific to the verdict citation networks this indicates that verdicts that have the same citations are similar and that similar verdicts tend to have the same citations.
If a possible link between the node pair $(x,y)$ is being evaluated then the common referrers index takes the following form

$$
s_{x,y} = \Gamma_{in}(y) \cap \Gamma_{in}(u) \ \forall \ u \ \in \ \Gamma_{out}(x) 
$$

The neighborhood is also shown in figure 4 where the common referrers are the $z$-nodes marked in red for a potential connection between $x$ and $y$.

## Weighting schemes

The scores from the presented neighborhoods can be used as is where for common neighbors and common referrers the score is simply the number of nodes in the returned neighborhood and for triadic closeness it is the ratio of potential triad counts to the counts of the current triad.

In many cases it has however proven useful to weigh the scores by some function, for instance dividing the size of the returned neighborhood with the size of the potential returned neighborhood (Jaccard) or weighing the returned nodes with the inverse of their degree (RA). In general these weighting schemes can be applied to any of the returned neighborhoods and involves multiplying either the cardinality of the set with some factor or replacing the raw node counts with some weighted sum and they have the common denominator that they are all based on structural features and no meta information about the nodes is needed.

<figure style="display: inline-block; float: right; text-align: center; max-width: 50%; margin: 0 20px">
<img src="Pictures/clust.png" style="float: right; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Clustering coefficient as a function of node degree. Indicates that two nodes sharing a high degree neighbor are less likely to share a connection than two nodes sharing a low degree neighbor.</figcaption>
</figure>
An indication of the usefulness of weighting results can be seen in the figure to the side, where the clustering coefficient of nodes can be seen as a function of degree. The decreasing coefficient indicates that two nodes sharing a neighbor with a high degree are less likely to create a connection.

The different weighting schemes are presented below where $\Gamma$ is the neighborhood function and $k_{in, z}$ is the in-degree of node $z$.

#### Jaccard
The Jaccard index weighs the common neighbors of $x$ and $y$ with their total number of potential common neighbors making connections between high degree nodes less likely than connections between low degree nodes (Survey paper).
$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{\Gamma(x) \cup \Gamma(y)}$ 

#### Adamic / Adar

Based on the idea that shared neighborhood with high degree nodes are less significant than low degree nodes (Friends and neighbors on the web).
$s_{x,y} = \sum_{z \ \in \ \Gamma(x) \cap \Gamma(y)} \frac{1}{\log k_z}$

#### Resource Allocation

Resource allocation is similar to Adamic/Adar except for not taking the log to the degree of the neighbor node and in that way penalizing high degree nodes more. It is based on the concept of information flow in a network, where the probability of a signal reaching node $y$ from node $x$ through $z$ is inversely proportional to the node degree of $z$ (Lu survey source 51).

$s_{x,y} = \sum_{ z \ \in \ \Gamma(x) \cap \Gamma(y)} \frac{1}{k_z}$

#### Leicht-Holme-Newman

Leicht, Holme and Newman proposed this index in (Lu survey source 38) as a method for indexing nodes that have a high number of of common neighbours compared to their expected number of common neighbors which is proportional to the product of $k_x$ and $k_y$.

$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{k_x \times k_y}$

#### Hub Depressed Similarity

$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{\max(k_x, k_y)}$

#### Hub Promoted similarity

$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{\min(k_x, k_y)}$

### Temporal features

As link prediction has evolved to work on more and more complex and dynamic networks, it has become necessary to devise methods that can capture other dimensions than only the static structure used so far. For networks that evolve over time this includes looking at repeating patterns in the activity between nodes (Temporal Link Prediction and Tensor Factorizations), time-series based methods of predicting network features in the future (Time Series Based Link Prediction), burstiness of incoming links and link probabilities that decay over time (Time aware index for link prediction in social networks).

Unlike social networks the verdict citation networks is semi-static since once inserted a node cannot add more outgoing links and once created it is not possible to remove a link. This limits the usefulness of many temporal methods studied in the literature, leaving deprecating likelihoods and burstiness as possible features.

#### Deprecating likelihoods
The intuition behind deprecating likelihoods is that it is plausible that a verdict will make references that are close in time as older verdicts might be less relevant or further in memory for the judge making the verdict. In the histograms below distributions are shown for different years, where it is easy to see that shorter links in time are preferred over longer links.
This can be used to create a per year probality distribution and weighing all potential links by that probability.
Since it is technically not possible to use the distribution for the current year for new nodes the distribution for the preceding year is used, i.e. for a node inserted at time $t$ the distribution for the $t-1$ is used. That this is a decent approximation can be validated by looking at the correlation between the different time distributions, where the average for each year is found to be 0.82 with all newer correlations higher than 0.9.

<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 auto">
<img src="Pictures/out link time dist.png" style="margin: 0 auto; max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Distribution of years linked to for 2013, 2003, 1993 and 1983</figcaption>
</figure>

#### Burstiness
The idea of burstiness is that many natural phenomena tend to appear in bursts - many events appearing in rapid succession followed by longer pauses with no activity. This can be seen in for instance earthquakes where aftershocks often follow larger shakes, in crime where a neighborhood or a single house is more likely to be victimized after having been robbed once (Numberphile) or in neurology where neurons exchange signals in spike trains, short bursts of activity followed by longer pauses. Mathematically this means that the process is not a pure Poisson process as the interarrival time of events is not independent and it is instead described by a XXX process, a type of self-excited Poisson process.
<figure style="display: inline-block; float: right; text-align: center; max-width: 50%; margin: 0 auto">
<img src="Pictures/interarrival times histogram.png" style="margin: 0 auto; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Distribution of interarrival times for incoming links, cut-off at 2000</figcaption>
</figure>
To determine if the interarrival times of incoming links exhibits burstiness the interarrival time for each link is calculated and shown in the histogram to the side. A bursty process will usually have a bi-modal distribution with peaks at both the short interarrival times in bursts and a peak at the longer interburst times (Cross Validated answer). This behaviour is not shown in the histogram which seems closer to the exponential distribution of interarrival times in a Poisson process. Additionally the coefficient of variation, $\frac{\sigma}{\mu}$, of the interarrival times is 1.79, which is fairly close to the 1.0 expected of a Poisson process (Cross validated answer).
With these two results it seems safe to assume that using burstiness as an index will not improve results in the link prediction.

### Evaluation methods

Link prediction is a very imbalanced classification problem, meaning that there are far more nodes that do not share a link than those that do share a link. This can lead to hard to interpret or misleading results of the classification methods (Link prediction, fair evaluation) especially when using the traditional area under the receiver operating characteristic curve (ROC) metric for evaluating links. A better alternative is to use precision recall curve (PRC) which is proven to give equivalent results to ROC (The relationship between precision-recall and ROC).

Another issue is that the number of potential edges is $V^2$ where $V$ is the number of vertices in the network and the ratio of existing edges to non-existing edges, $\frac{E}{V^2-E}$, is usually very small, resulting in very high computation times. Undersampling the non-existing edges leads to erroneous precision results, so instead the existing edges are undersampled and scores are only calculated for non-existing edges that will give a score higher than 0. This results in lower computation time without distorting the methods precision while still following the guidelines set out by Lichtenwalter and Chawla (Link prediction, fair evaluation) for evaluation of link prediction methods.

For the network of verdicts a reasonable application is a recommender system so an evaluation method that mimics this use-case is preferable. We choose a node centric approach where nodes of a specific degree, $k$, are chosen to create a test set, $V_{test}$ and for each node in the test set we remove all but one edge from the node and attempt to predict the remaining $k-1$ nodes. Precision is then calculated based on the results of these predictions.


<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 20px">
<img src="Pictures/common neighbors degree comparison.png" style="max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Clustering coefficient as a function of node degree. Indicates that two nodes sharing a high degree neighbor are less likely to share a connection than two nodes sharing a low degree neighbor.</figcaption>
</figure>

### Results


<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 20px">
<img src="Pictures/common neighbors degree comparison.png" style="max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Clustering coefficient as a function of node degree. Indicates that two nodes sharing a high degree neighbor are less likely to share a connection than two nodes sharing a low degree neighbor.</figcaption>
</figure>


<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 20px">
<img src="Pictures/per_node_validation.png" style="max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Clustering coefficient as a function of node degree. Indicates that two nodes sharing a high degree neighbor are less likely to share a connection than two nodes sharing a low degree neighbor.</figcaption>
</figure>


<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 20px">
<img src="Pictures/weight comparison.png" style="max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Clustering coefficient as a function of node degree. Indicates that two nodes sharing a high degree neighbor are less likely to share a connection than two nodes sharing a low degree neighbor.</figcaption>
</figure>

#### Implementation

Implementing triadic closeness scoring requires three parts

* A method that takes three nodes and returns the triad ID
* A method that returns the distribution of triads for an entire network
* A method that calculates the actual score for a pair of nodes, $x$ and $y$

Finding the triad ID is done through the fairly complicated nest of `if` statements below. Note that Python does not have `switch`statements, otherwise this would be the obvious thing to use here.

Returning the triad distribution for the entire network can now be done easily with a nested loop as below. Note that the way Schall set up the different categories of triads means that some information will be duplicated, i.e. every time there is an instance of node $x$ connecting to node $y$ through $z$ the opposite will also be true. `networkx` implements a different method called `nx.triadic_census` which does not do this. Calculating the distribution is the most computationally demanding part of this method, but luckily the results can be stored so it only needs to be computed once for each graph.

Finally the scoring is implemented as a method that takes a candidate edge as an argument and returns that edge with the TC score as an attribute.

These methods together can now be combined with a k-fold validation scheme that works on training and validation sets. If you do this it is important to recalculate the triad distribution at each fold, as changing the training set also means changing the graph.



#### Potential theory

#### Motif analysis

### A common neighbours formulation for the temporal DAG
The directed, acyclic nature can be helpful in establishing a directed version of the undirected common neighborhood of two nodes. For an undirected network like the Facebook friend network the number of friends in common can be a good predictor for a missing or future link, but for a directed and dynamic network like Wikipedia the direction of the links needs to be taken into account. If for instance two articles have many outgoing links in common, i.e. both reference many of the same articles, then it doesn't necessarily mean that these two articles need to be linked. The same goes for the incoming links, if two articles are referenced by many of the same articles it does not necessarily mean that there should be a link between the two. A more logical idea is that if two articles are in the out-neighborhood of many of the same articles, then they should probably be linked

* Common neighbors in the out-neighborhood between $x$ and $y$ indicates that both articles reference the same articles
* Common neighbors in the in-neighborhood between $x$ and $y$ indicates that both articles are referenced by the same articles
* Having $y$ in the out-neighborhood of $z$ and $z$ in the out-neighborhood of $x$ indicates that $y$ is referenced by articles referenced by $x$
* Having $y$ in the in-neighborhood of $z$ and $z$ in the in-neighborhood of $x$ indicates that $y$ references articles that reference $x$

The additional constraint of nodes arriving in the network only having outgoing edges (since they are the newest verdict and thus cannot be referenced by others) leaves us with only two permutations to be considered, which can be seen in the figure below where the possible arrangements are A and B marked with a red border.


It is clear that the simply collapsing these two cases into one and counting the common neighbors is the same as looking at the undirected graph.
Calling the in-neighborhood of the node $v$ for $\Gamma_{in}(v)$ and the out-neighborhood for $\Gamma_{out}(v)$ and assuming that $v$ is a new arrival to the network the following holds

$$
\Gamma(v) = \Gamma_{out}(v) \\
\Gamma(v) \cap \Gamma(u) = (\Gamma_{out}(v) \cap \Gamma_{in}(u)) \cup (\Gamma_{out}(v) \cap \Gamma_{out}(u))
$$

This opens up for the possibility of weighting each component of the neighborhood differently or using different structural methods on each, i.e. Jaccard similarity on incoming edges and Resource Allocation on outgoing edges. 