# Table of Contents
* &nbsp;
	* &nbsp;
		* [Network characteristics](#Network-characteristics)


### Network characteristics

The network of verdicts has the following characteristics

1. It is directed since each link either references or is a reference to another verdict
2. It has a simple temporal structure where each verdict can only reference back in time
3. Once eastablished, a node is static and cannot change based on new nodes being added to the network
4. It is acyclic, since two verdicts referencing each other contradicts the statements in point 1 and 2

If we assume that the likelihood of a link being made is independent from the distance in time the only thing that is needed for taking the temporal structure into account is pruning the network to the timestamp of the link being validated, leaving a network as a directed acyclic graph where it is guaranteed that the newest node only has outgoing links. 
<figure style="display: inline-block; float: right; text-align: center; max-width: 50%; margin: 0 20px">
<img src="Pictures/DAG.png" style="float: right; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Possible triadic configurations for a directed, acyclic graph where $x$ and $y$ is the possible link being investigated. A and B are the only possibilities if $x$ cannot have incoming links.</figcaption>
</figure>
Any three nodes in a general DAG can be configured one of four ways and the additional constraint of the newest node only having outgoing links reduces the possible triadic patterns to two as seen in the figure on the right. A more general directed network that allowed for reciprocal links between nodes would have a total of 36 different triadic configurations.

### Link prediction methods in directed networks
Most methods presented for link prediction has focused on undirected networks where the predictor is some variant of the common neighbours formulation with some weighting scheme. These have proved to be efficient in many real networks and even perform well in directed networks when discarding the directedness [](#cite-lu). **Show the entire unsupervised framework here?**
Unsupervised methods for directed methods also exist and in general they try to exploit the richer structural information found in directed networks to make better predictions, either by defining specific in- and out-networks of interest or looking at the likelihood of the triadic configuration that would be created by a predicted link.

### Common Neighbors

Common neighbors or the intersection of the 1-neighborhoods of the nodes $x$ and $y$ is the most commonly used neighborhood in link prediction and it has been shown that the likelihood of a link between two nodes correlates highly with the distance between the nodes as well as the number of neighbors shared (Link prediction, fair evaluation) (Lu, survey).
Given that 

### Triadic closeness
In a network a triad is a combination of three nodes, $x$, $z$ and $y$, along with the possible links between them. Every combination is a specific type of triad and for a directed graph there are 16 different ways, but this increases to 36 when we consider that $z$ is a common neighbor to $x$ and $y$ and that for link prediction the order matters, the triad consisting of the links ${(x,y), \ (x,z), \ (y,z)}$ is not the same as  ${(y,x), \ (x,z), \ (y,z)}$. The different combinations are seen in the figure where they are classified based on the link between $x$ and $y$, with 1-9 meaning no link, 11-19 meaning a link from $x$ to $y$, 21-29 meaning a link from $y$ to $x$ and 31-39 being a reciprocal link between $x$ and $y$.

<figure style="display: inline-block; text-align: center; max-width: 100%; margin: 0 auto">
<img src="Pictures/open triads.png" style="margin: 0 auto; max-width: 85%; margin-bottom: 10px">
<figcaption style="font-style: italic">Possible open triadic configurations for a directed graph, with the category label below the triad.</figcaption>
</figure>
<figure style="display: inline-block; float: right; text-align: center; max-width: 45%; margin: 0 20px">
<img src="Pictures/closed triads.png" style="float: right; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Possible closed triadic configurations for a directed graph, with the category label below the triad.</figcaption>
</figure>
Triadic closeness is an unsupervised method proposed by Schall [](#cite-schallTC) that exploits the fact that some triadic patterns are more likely to appear in a given network than others, so if you have a possible link between nodes $x$ and $y$ that have the common neighbor $z$ the likelihood of the created triad appearing in the network can be used as a score. In other words, the likelihood for a link between $x$ and $y$ depends on how common the triad formed from adding the link is compared to how common the triad formed by the unconnected $x$ and $y$.

This is related to [motif analysis](https://en.wikipedia.org/wiki/Network_motif), specifically motif analysis where every motif is a sub-graph of size 3 and Schall [](#cite-schallTC) produced good results on data from Twitter, GitHub and Google+.

The actual triadic closeness score is calculated as

$$
s_{x,y} = \sum_{z \ \in \Gamma(x) \cup \Gamma(y)} \frac{\text{F}(\delta(x,y,z) + 10) + \text{F}(\delta(x,y,z) + 30)}{\text{F}(\delta(x,y,z))}
$$

where $s_{x,y}$ is the score for the $x,y$ node pair, $\Gamma(x)$ returns the neighborhood of $x$, $\delta(x,z,y)$ returns the triad produced by the nodes $x$, $y$ and $z$ and $\text{F}(t)$ returns the frequency of the triad $t$. Note that the neighborhood consists of both in- and out-going links.

Since a triad requires that the nodes $x$ and $y$ are joined through an extra node $z$ nodes that are linked but do not share neighbors will not be found by this method. If the network being examined has important structure beyond the 1-neighborhood more general motif analysis methods can be used.

### Common referrers
<figure style="display: inline-block; float: right; text-align: center; max-width: 33%; margin: 0 20px">
<img src="Pictures/Common referrers.png" style="float: right; max-width: 100%; margin-bottom: 10px">
<figcaption style="font-style: italic">Common referrer neighborhood for a possible connection between $x$ and $y$ where the common referrers are the $z$-nodes marked in red.</figcaption>
</figure>
Common referrers is an expansion of the common neighbors metric which assumes that if $x$ is connected to $z$ through $u$, then $x$ is also more likely to connect to other nodes that $z$ is connected to. Specific to the verdict citation networks this indicates that verdicts that have the same citations are similar and that similar verdicts tend to have the same citations.
If a possible link between the node pair $(x,y)$ is being evaluated then the common referrers index takes the following form

$$
s_{x,y} = \Gamma_{in}(y) \cap \Gamma_{in}(u) \ \forall \ u \ \in \ \Gamma_{out}(x) 
$$

The neighborhood is also shown in figure 4 where the common referrers are the $z$-nodes marked in red for a potential connection between $x$ and $y$.

### Preferential attachment
Preferential attachment is related to the growth mechanism of Barabasi-Albert networks where the probability of newly inserted node, $x$, connecting to another node, $y$, is proportional to the degree of $y$.

asdasas

## Weighting schemes

The scores from the presented neighborhoods can either be used as is where for common neighbors and common referrers the score is simply the number of nodes in the returned neighborhood and for triadic closeness it is the ratio of potential triad counts to the counts of the current triad.

In many cases it has however proven useful to weigh the scores by some function, for instance dividing the size of the returned neighborhood with the size of the potential returned neighborhood (Jaccard) or weighing the returned nodes with the inverse of their degree (RA). In general these weighting schemes can be applied to any of the returned neighborhoods and involves multiplying either the cardinality of the set with some factor or replacing the raw node counts with some weighted sum and they have the common denominator that they are all based on structural features and no meta information about the nodes is needed.

The different weighting schemes are presented below where $\Gamma$ is the neighborhood function and $k_{in, z}$ is the in-degree of node $z$.

#### Jaccard
The Jaccard index weighs the common neighbors of $x$ and $y$ with their total number of potential common neighbors making connections between high degree nodes less likely than connections between low degree nodes (Survey paper).
$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{\Gamma(x) \cup \Gamma(y)}$ 

#### Adamic / Adar

Based on the idea that shared neighborhood with high degree nodes are less significant than low degree nodes (Friends and neighbors on the web).
$s_{x,y} = \sum_{z \ \in \ \Gamma(x) \cap \Gamma(y)} \frac{1}{\log k_z}$

#### Resource Allocation

Resource allocation is similar to Adamic/Adar except for not taking the log to the degree of the neighbor node and in that way penalizing high degree nodes more. It is based on the concept of information flow in a network, where the probability of a signal reaching node $y$ from node $x$ through $z$ is inversely proportional to the node degree of $z$ (Lu survey source 51).

$s_{x,y} = \sum_{ z \ \in \ \Gamma(x) \cap \Gamma(y)} \frac{1}{k_z}$

#### Leicht-Holme-Newman

Leicht, Holme and Newman proposed this index in (Lu survey source 38) as a method for indexing nodes that have a high number of of common neighbours compared to their expected number of common neighbors which is proportional to the product of $k_x$ and $k_y$.

$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{k_x \times k_y}$

#### Hub Depressed Similarity

$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{\max(k_x, k_y)}$

#### Hub Promoted similarity

$s_{x,y} = \frac{\Gamma(x) \cap \Gamma(y)}{\min(k_x, k_y)}$

### Evaluation methods

Link prediction is a very imbalanced classification problem, meaning that there are far more nodes that do not share a link than those that do share a link. This can lead to hard to interpret or misleading results of the classification methods (Link prediction, fair evaluation) especially when using the traditional area under the receiver operating characteristic curve (ROC) metric for evaluating links. A better alternative is to use precision recall curve (PRC) which is proven to give equivalent results to ROC (The relationship between precision-recall and ROC).

Another issue is that the number of potential edges is $V^2$ where $V$ is the number of vertices in the network and the ratio of existing edges to non-existing edges, $\frac{E}{V^2-E}$, is usually very small, resulting in very high computation times. Undersampling the non-existing edges leads to erroneous precision results, so instead the existing edges are undersampled and scores are only calculated for non-existing edges that will give a score higher than 0. This results in lower computation time without distorting the methods precision while still following the guidelines set out by Lichtenwalter and Chawla (Link prediction, fair evaluation) for evaluation of link prediction methods.

For the network of verdicts a reasonable application is a recommender system so an evaluation method that mimics this use-case is preferable. We choose a node centric approach where nodes of a specific degree, $k$, are chosen to create a test set, $V_{test}$ and for each node in the test set we remove all but one edge from the node and attempt to predict the remaining $k-1$ nodes. Precision and recall are then calculated based on the results of these prediction.

#### Results



#### Implementation

Implementing triadic closeness scoring requires three parts

* A method that takes three nodes and returns the triad ID
* A method that returns the distribution of triads for an entire network
* A method that calculates the actual score for a pair of nodes, $x$ and $y$

Finding the triad ID is done through the fairly complicated nest of `if` statements below. Note that Python does not have `switch`statements, otherwise this would be the obvious thing to use here.

Returning the triad distribution for the entire network can now be done easily with a nested loop as below. Note that the way Schall set up the different categories of triads means that some information will be duplicated, i.e. every time there is an instance of node $x$ connecting to node $y$ through $z$ the opposite will also be true. `networkx` implements a different method called `nx.triadic_census` which does not do this. Calculating the distribution is the most computationally demanding part of this method, but luckily the results can be stored so it only needs to be computed once for each graph.

Finally the scoring is implemented as a method that takes a candidate edge as an argument and returns that edge with the TC score as an attribute.

These methods together can now be combined with a k-fold validation scheme that works on training and validation sets. If you do this it is important to recalculate the triad distribution at each fold, as changing the training set also means changing the graph.



#### Potential theory

#### Motif analysis

### A common neighbours formulation for the temporal DAG
The directed, acyclic nature can be helpful in establishing a directed version of the undirected common neighborhood of two nodes. For an undirected network like the Facebook friend network the number of friends in common can be a good predictor for a missing or future link, but for a directed and dynamic network like Wikipedia the direction of the links needs to be taken into account. If for instance two articles have many outgoing links in common, i.e. both reference many of the same articles, then it doesn't necessarily mean that these two articles need to be linked. The same goes for the incoming links, if two articles are referenced by many of the same articles it does not necessarily mean that there should be a link between the two. A more logical idea is that if two articles are in the out-neighborhood of many of the same articles, then they should probably be linked

* Common neighbors in the out-neighborhood between $x$ and $y$ indicates that both articles reference the same articles
* Common neighbors in the in-neighborhood between $x$ and $y$ indicates that both articles are referenced by the same articles
* Having $y$ in the out-neighborhood of $z$ and $z$ in the out-neighborhood of $x$ indicates that $y$ is referenced by articles referenced by $x$
* Having $y$ in the in-neighborhood of $z$ and $z$ in the in-neighborhood of $x$ indicates that $y$ references articles that reference $x$

The additional constraint of nodes arriving in the network only having outgoing edges (since they are the newest verdict and thus cannot be referenced by others) leaves us with only two permutations to be considered, which can be seen in the figure below where the possible arrangements are A and B marked with a red border.


It is clear that the simply collapsing these two cases into one and counting the common neighbors is the same as looking at the undirected graph.
Calling the in-neighborhood of the node $v$ for $\Gamma_{in}(v)$ and the out-neighborhood for $\Gamma_{out}(v)$ and assuming that $v$ is a new arrival to the network the following holds

$$
\Gamma(v) = \Gamma_{out}(v) \\
\Gamma(v) \cap \Gamma(u) = (\Gamma_{out}(v) \cap \Gamma_{in}(u)) \cup (\Gamma_{out}(v) \cap \Gamma_{out}(u))
$$

This opens up for the possibility of weighting each component of the neighborhood differently or using different structural methods on each, i.e. Jaccard similarity on incoming edges and Resource Allocation on outgoing edges. 