# Networks and their Structure Assignment

## Network Science Topic 3

Note that the networks in this exercise are all undirected.

Here is a definition of a further kind of centrality known as *assignment centrality*.  It is similar to closeness centrality in that it depends on distances, but the definition is slightly different.  Again, let $d_{ij}$ be the distance from $i$ to $j$.  Then the assignment centrality of $j$ is

$$ \sum_{i \neq j} \frac{1}{d_{ij}} $$

where $1/d_{ij}$ is 0 if there is no path from $i$ to $j$.

1. [3 marks] Describe one advantage of using assignment centrality in preference to closeness centrality.

**The advantage is that assignment centrality produces numbers that are easier to work with, especially with larger networks, than closeness centrality. This is because of their respective definitions. Closeness centrality adds distances in the denominator with a constant numerator of 1, so the resulting numbers are small and close to zero, which is hard to interpret. In contrast, assignment centrality adds the fractions together so that the result is larger and corresponds to a node with a higher value having a greater effect on centrality - this is more meaningful.**


The Medici family were, through success in commerce and banking, wealthy and politically powerful in Florence beginning in the 13th century.  It has been suggested that their prominence can be explained by considering the network, displayed below, of Florentine families and their links by marriage.

<img src="medici.jpg" width="400">

2. [5 marks] Calculate the assignment centrality of each node in the network and comment on whether the results support the theory that the Medici's power was a result of their position in this network.


Peruzzi: $(\frac{1}{1} * 3) + (\frac{1}{2} * 3) + (\frac{1}{3} * 4) + (\frac{1}{4} * 3) + (\frac{1}{5} * 1) + 0 = 6.783$  
Bischeri: $(\frac{1}{1} * 3) + (\frac{1}{2} * 5) + (\frac{1}{3} * 3) + (\frac{1}{4} * 2) + (\frac{1}{5} * 1) + 0 = 7.200$  
Castellani: $(\frac{1}{1} * 3) + (\frac{1}{2} * 3) + (\frac{1}{3} * 5) + (\frac{1}{4} * 3) + 0 = 6.916$  
Lamberteschi: $(\frac{1}{1} * 1) + (\frac{1}{2} * 3) + (\frac{1}{3} * 5) + (\frac{1}{4} * 4) + (\frac{1}{5} * 1) + 0 = 5.366$  
Strozzi: 

The next two questions require an implementation of Newman's agglomerative algorithm for community detection (see the hints at the end of these questions).  We met it in `topic3b.pdf`: it is described on Slide 4.  Recall that the algorithm finds many community decompositions.  It starts with the decomposition in which every node alone is a community, and then merges communities until the decomposition in which every node is in the same community is reached.  Thus the output to the algorithm contains a list of many decompositions. Also the algorithm calculates the change in the modularity $Q$ at each step so the relative value of $Q$ for each decomposition is part of the output.  For example, here is the output for the example on Slide 7.

$$ \begin{array}{ll}
\{1\}, \{2\}, \{3\}, \{4\}, \{5\}, \{6\}, \{7\} & Q=0\\
\{1, 2\}, \{3\}, \{4\}, \{5\}, \{6\}, \{7\} & Q=0.086 \\
\{1, 2, 3\}, \{4\}, \{5\}, \{6\}, \{7\} & Q=0.210 \\
\{1, 2, 3\}, \{4\}, \{5\}, \{6, 7\} & Q=0.296 \\
\{1, 2, 3\}, \{4\}, \{5, 6, 7\} & Q=0.420 \\
\{1, 2, 3, 4\}, \{5, 6, 7\} & Q=0.432\\
\{1, 2, 3, 4, 5, 6, 7\} & Q=0.160
\end{array}
$$

(Note that this implies that $\{1, 2, 3, 4\}, \{5, 6, 7\}$ is the best community decomposition as it has the greatest modularity.  Note also that these are relative values as the modularity of the initial decomposition is not zero, but this is enough information for us to determine which decomposition is best.)

3. [8 marks] Construct the network defined in `zachary.txt` ignoring edge weights.  Run Newman's agglomerative algorithm on this network and write down the output in the same format used in the example above.  The file relates to the example mentioned in the lecture of a karate club that split in two (see the description in the file).  How do your results compare with what actually happened?

Living cells can be considered as complex webs of macromolecular interactions known as *interactome networks*.  Finding communities in these networks can aid understanding of how the cells function.

4. [9 marks] Construct the network defined in `CCSB-Y2H.txt`.  The first two items on each line are a pair of nodes joined by an edge (the rest of the line can be ignored).  The network is of interactions amongst proteins in yeast. Run Newman's agglomerative algorithm on this network.  The best community decomposition found should be written to a text file with the nodes of each community on a separate line.  Are there many other decompositions of a similar quality?  (See the link in the text file for further information on this dataset.)

Let us describe an algorithm called *community finder*.  Rather than find a community decomposition, it just finds the community that a particular vertex $v$ belongs to.   The input to community finder is a network $N$, one of its nodes $v$ and a threshold value $a$ (a positive real number).  Let $d_i$ be the number of edges between nodes at distance $i$ from $v$ to nodes at distance $i+1$ from $v$.  So $d_0$ is just the degree of $v$.  Let $\Delta_i=d_i/d_{i-1}$.   Community finder calculates $\Delta_1, \Delta_2, \ldots$ and stops when it finds a value $\ell$ such that $\Delta_\ell < a$.  Then the community of $v$ is simply all nodes at distance at most $\ell$ from $v$.

5. [3 marks] Does the definition of community suggest that the community finder algorithm is likely to succeed in finding communities?  In what circumstances might it perform well or badly?

6. [6 marks] Implement community finder and apply to the network of `zachary.txt` and comment on the results compared to your findings of Question 3.  You will need to run it for various choices of $v$ and $a$.

7. [6 marks] You could run community finder for every node in a network, but then you would have as many communities as there are nodes and they would overlap.  Suggest how community finder could be used as the basis of a community decomposition algorithm (that, as usual, partitions the nodes into disjoint sets) and test your idea on the network of `zachary.txt`.  

----
#### Newman's agglomerative algorithm for community detection
 
If you can find a library to provide the algorithm, you can use that although it might require you to define networks in a different way.  Here is an outline approach to the implementation.  You should first reread the description of the algorithm.

Define a dictionary called, say, `communities` that throughout the run of the algorithm will contains as keys the current communities (so the keys will keep changing as communities are merged).  The values in the dictionary will record the nodes in the community and the number of endpoints incident with those nodes.

Define a dictionary called, say, `pairs` that throughout the run of the algorithm will contains as keys each pair of current communities.  The values in the dictionary will record the number of edges between the pair.

Define a function called, say, `deltaQ` whose input is a pair of communities and whose output is the change in modularity if that pair of communities is merged. 

Create a priority queue that contains each pair of communities keyed by the change in modularity if that pair were merged.

Then the algorithm repeatedly picks the pair of communities from the priority queue whose merger will give the maximum change in modularity and merges that pair (and updates `communities` and `pairs` and pushes new pairs onto the priority queue).

For the priority queue you could use

```python
from queue import PriorityQueue
```

but note that this only allows queues from which you can pick the entry with the minimum value.  You can make such a queue perform as a *maximum* priority queue by multiplying all the keys by -1.

Below is the network from the lecture that you can use to test your implementation.

Note that, with a reasonable implementation, no computation required by this exercise should take more than a couple of minutes.

In [1]:
test = {1: [2,3],
        2: [1,3],
        3: [1,2,4,5],
        4: [3,5],
        5: [3,4,6,7],
        6: [5,7],
        7: [6,7]}