<style>
.prompt input_prompt {
    background-color: #ffff99;
    border: 1px solid #ffcc66;
    padding: 15px;
}
</style>

### Complex Networks
### Programming Assignment 3
### Jason Thomas
### s3907634

---

First, upload the data and create a network before commencing question 1

It seems simple enough to just use the one edge list here. The reslulting network will have nodes that can be referred

In [1]:
namedEdgeList <- as.data.frame(read.table("names.dat.txt",
                                          sep = ",",
                                          skip=5));

`graph.data.frame` automatically treats a dataframe as an edge list for the first two columns any extra colums become edge attributes

In [2]:
colnames(namedEdgeList) <- c('n1', 'n2', 'weight');

library("igraph");

namedNet <- igraph::graph.data.frame(namedEdgeList,
                                     directed = FALSE);


Attaching package: ‘igraph’


The following objects are masked from ‘package:stats’:

    decompose, spectrum


The following object is masked from ‘package:base’:

    union




As a sanity test, this should match the size listed in the edgelist:

In [3]:
names = V(namedNet)$name;
print(length(names))

[1] 1773


### Question 1 - Discussion

If two names are connected in the text more than once (more than one mention) then this is represented as one edge that connects those two nodes as having larger weight. For example, if there are 4 mentions that include a particular pair of names in the text then the edge connecting the relevant nodes will have edge weight 4. This report will later mention edge weights that represent "duplicated edges".

Then there are two ways to analyse centrality in this network:
- ignore edge weights
- consider edge weights

The assignment doesn't mention what to do with edge weights so this report takes some liberties.

Below is a table of centrality measures that this report will look at and the reason for including those.

| Centrality  | Method | Assignment asks for this | Reason |
| :---------- | :----- | :----- | :----- |
| Degree | igraph function | Yes | While this does not account for edge weights, this reduces the impact of repetative writing. | 
| Total edge weight | take sum of combined weight of all edges connected to a node | No | This is similar to degree centrality but will account for edge weights. The assumption is that duplicated edges are important. |
| Closeness | igraph function | Yes | Edge weights will not affect the distance between any two nodes, so we should use the igraph function as-is. Nodes at the network centre are important. |
| Eigenvector | igraph function | Yes | While this does not account for edge weights, this reduces the impact of repetative writing. |
| Eigenvector x total edge weight | product of two measures | No | Edge weights potentially have an impact on the importance of a node based on its neighbours' importance. The assumption is that duplicated edges are important. |


### Question 1 - Solution

In [4]:
source("my_functions.R")

Before calculating centrality, there's an issue to address: there are some disconnected components. This will negatively affect the closeness measure in particular. Therefore let's take only the giant component going forward.

In [5]:
giantComponent = getGiantComponent(namedNet);
names = V(giantComponent)$name;
centralityData = data.frame(names=names);

degreeCentrality = centr_degree(giantComponent)$res;
centralityData = cbind(centralityData, degreeCentrality);

edgeWeightTotals = getEdgeWeightTotals(giantComponent, names);
centralityData = cbind(centralityData, edgeWeightTotals);

closenessCentrality = centr_clo(giantComponent)$res;
centralityData = cbind(centralityData, closenessCentrality);

eigenvectorCentrality = centr_eigen(giantComponent)$vector;
centralityData = cbind(centralityData, eigenvectorCentrality);

eigenvectorEdgeWeights = eigenvectorCentrality * edgeWeightTotals;
centralityData = cbind(centralityData, eigenvectorEdgeWeights);

This is a quick acid test, and it should be true because we count an edge weight for each of its nodes.

In [6]:
sum(edgeWeightTotals)/2 == sum(E(giantComponent)$weight)

Below, we can see the full results for all measures of centrality.

In [16]:
topListThreshold = 15
namesByRank = data.frame("rank"=1:topListThreshold);

#degree
degreeOrdered = centralityData[order(centralityData$degreeCentrality,
                                     decreasing = TRUE ), ];
degreeTop15 = degreeOrdered[1:topListThreshold,1];
namesByRank = cbind(namesByRank, "degree" = degreeTop15);

#edge weights
edgeWeightsOrdered = centralityData[order(centralityData$edgeWeightTotals, 
                                          decreasing = TRUE ), ];
edgeWeightTop15 = edgeWeightsOrdered[1:topListThreshold,1];
namesByRank = cbind(namesByRank, "edge weight" = edgeWeightTop15);

#closeness
closenessOrdered = centralityData[order(centralityData$closenessCentrality, 
                                        decreasing = TRUE ), ];
closenessTop15 = closenessOrdered[1:topListThreshold,1];
namesByRank = cbind(namesByRank, "closeness" = closenessTop15);

#eigenvector
eigenvectorOrdered = centralityData[order(centralityData$eigenvectorCentrality, 
                                          decreasing = TRUE ), ];
eigenvectorTop15 = eigenvectorOrdered[1:topListThreshold,1];
namesByRank = cbind(namesByRank, "eigenvector" = eigenvectorTop15);

#eigenvector x edge weights
eigenvectorEdgeWeightsOrdered = centralityData[order(centralityData$eigenvectorEdgeWeights, 
                                                     decreasing = TRUE ), ];
eigenvectorEdgeWeightsTop15 = eigenvectorEdgeWeightsOrdered[1:topListThreshold,1];
namesByRank = cbind(namesByRank, "eigenvector x edge weights" = eigenvectorEdgeWeightsTop15);


In [18]:
namesByRank

rank,degree,edge weight,closeness,eigenvector,eigenvector x edge weights
<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,israel,israel,israel,israel,israel
2,judah,judah,judah,judah,judah
3,david,david,jerusalem,david,david
4,jerusalem,jerusalem,david,jerusalem,jerusalem
5,egypt,moses,egypt,egypt,moses
6,benjamin,saul,ephraim,benjamin,egypt
7,manasseh,egypt,manasseh,ephraim,saul
8,ephraim,manasseh,benjamin,manasseh,benjamin
9,saul,benjamin,joseph,moses,manasseh
10,philistines,aaron,moses,jordan,ephraim


### Question 1 - results

It is clear, in all measures, that the most important place is Israel and the most important person is Judah.

The names that appear in all (five) lists are:

In [8]:
intersect(degreeTop15,
          intersect(edgeWeightTop15,
                    intersect(closenessTop15,
                              intersect(eigenvectorTop15, 
                                        eigenvectorEdgeWeightsTop15))))

But since considering the edge weights was a new idea, as was combining the eigenvector centrality with the edge weights, then this is a reduced comparison that is more specific to the question asked:

In [9]:
intersect(degreeTop15,
          intersect(closenessTop15, 
                    eigenvectorTop15))

### Question 2

There were originally three measures asked for in the assignment and two others that I introduced.

Of these, I think that three out of five are useful going forward.

| Centrality  | Effective | What makes it effective |
| :---------- | :-------- | :---------------------- | 
| Degree | No | The edge weights might indicate that the person who wrote the text was prone to repeating themselves, and repeating pairs of names, unnecessarily. However, it seems most likely that many different passages of the text would mention pairs of names, so to measure the degree of nodes without edge weights (duplicated edges) is naive. | 
| Total edge weight | Yes | The edge weights might indicate that each pair was a separate passage in the text, and perhaps a different person who wrote the passage. Then this indicates more than one story told about this pair of names. It seems that edge weights are important is the most plausible explanation, considering that the bible is an amalgamation of texts by many authors. |
| Closeness | Yes | A node that is at the centre of the network is relatively important compared to nodes at the fringes. |
| Eigenvector | No | If a pair of names appears then it might be the case that a relatively minor event occurred. For example, a person who passes through an important place like Jeruselem, might be a less-important person. Are we to consider a person as being important because that person passed by an important place? This does not seem to make sense. |
| Eigenvector x total edge weight | Yes | This is an improvement on the former measurement. If a person passes through an important place, but they mentioned as doing so many times, then this seems to increase the probability that this person would be important. |

From the above analysis, it seems that `degree centrality` and `eigenvector centrality` are the least-promising. 

There are two other measures of centrality we haven't looked at in this report: betweenness and physical centrality. 

Betweenness would be interesting to look at and I think it would give a similar insight to closeness.

Physical centrality would not make much sense in terms of this text since many nodes are people. However if you took a smaller network of place names then physical centrality might be interesting, however, Israel is a coastal location so this would mean that Jerusalem (on the coast) might not be the centre, given that the sea will have fewer locations to talk about.

In terms of actually quantifying what the most the important nodes, I suggest that we could aggregate the top 15 to find the most common nodes. We could create an ensemble of all lists.

### Question 3

There is no need to continue looking into the giant component, so we will look into the entire network.

We should do two calculations for assortativity:
- edge weights are not considered
- edge weights are considered

### Question 3 - assortativity (not including edge weights)

First we implement the version that does not include edge weights because we can extend it later to account for edge weights.

Let $ e_{ij} $ be a matrix that represents edge count between nodes with degree $ i, j $.

We can construct $ e_{ij} $ by iterating over the network's edges and then incrementing the edge count for each pair. In a sense the weight of each edge is implicitly $ 1 $.

In [10]:
source('my_functions.R')

degreeOfNodes = degree(namedNet);
namesInNet = V(namedNet)$name;

kMax = max(degreeOfNodes);
edgeCount = ecount(namedNet);

e_k_kPrime = constructMatrixUsingDegree(degreeOfNodes,
                                        namedNet,
                                        namesInNet,
                                        kMax,
                                        edgeCount)


As another quick test, since degree is entered as $ij$ and then $ ji$ then:

In [11]:
sum(e_k_kPrime)/2 == edgeCount

We can use this to calculate assortativity:

$ k_{nn}(k) = \sum_{k'=1}^{kmax} k' \cdot P(k'|k) $

$ P(k'|k) = \frac{e_{kk'}}{\sum_{k'} e_{kk'}} $

And this means we can define $ k $ as an independent variable, then $ k_{nn}(k) $ will be a dependent variable. Plotting these two should reveal a relationship.

So the algorithm can run like this:

- for each $ k \in [1:kmax] $:
    - sum over $ k' \in [1:kmax]: e_{kk'} $ (used in $ P(k'|k) $, calculate it once per $ k $ to improve performace)
    - sum over $ k' \in [1:kmax] $:
        - find $ k\cdot P(k'|k) $

These functions are provided in `my_functions.R`

Below, $ k_{nn}(k) $ is found and then used to create a linear model. Depending on the model's gradient, $ \beta_1 $:

$ \beta_1 > 0 $: assorative

$ \beta_1 = 0 $: neutral

$ \beta_1 < 0 $: disassorative

In [12]:
k = 1:kMax;
beta_1 = lm(k_nn(e_k_kPrime, k, kMax) ~ k)$coefficients[[2]]
beta_1

Then the network where only node degree is considered, and not edge weights, is neutral or a little disassortative.

### Question 3 - assortativity (including edge weights)

In [13]:
kMax = max(edgeWeightTotals);

edgeWeightTotals = getEdgeWeightTotals(namedNet, names);

e_k_kPrime = constructMatrixUsingEdgeWeights(edgeWeightTotals,
                                             namedNet,
                                             namesInNet,
                                             kMax,
                                             edgeCount)

This should now be equal to double the total edge weights in the network:

In [14]:
# sum(e_k_kPrime) == sum(edgeWeightTotals)
sum(e_k_kPrime)
sum(edgeWeightTotals)

TODO: implement tests to confirm that this function works as expected, or not.

I'd like to fix this bug. Since it's only some values then it shouldn't aggect the result too much.

In [15]:
k = 1:kMax;
beta_1 = lm(k_nn(e_k_kPrime, k, kMax) ~ k)$coefficients[[2]]
beta_1

### Question 3 - discussion

When we ignored edge weights, assortativity was close to $ 0 $ or even a little negative, so we could say it is disassortative.

However when we included edge weights the effect was to make the network assortative.

The second result (including edge weights) seems more promising since we have no reason to believe duplicate edges in the network are unimportant.

Finding a stronger score for assortativity once duplicated edges (edge weights) are considered is not surprising, because duplicating an edge increases the importance of the nodes at either end.

### Appendix

In [None]:
### My code 

getEdgeWeightTotals = function(network, names) {
    edgeWeightTotals = c();
    for (name in names) {
        # Being an undirected graph, we could take .from or .to
        edgesToThisNode = E(network)['.from'(name)]
        combinedWeight = sum(edgesToThisNode$weight)
        edgeWeightTotals = c(edgeWeightTotals, combinedWeight);
    }
    (edgeWeightTotals)
}

getGiantComponent = function(network) {
    components = igraph::clusters(network, mode="weak");
    componentId = which.max(components$csize);
    vertexIds = V(namedNet)[components$membership == componentId];
    giantComponent = igraph::induced_subgraph(network, vertexIds);
    (giantComponent)
}

constructMatrixUsingDegree = function(degreeOfNodes, namedNet, namesInNet, kMax, edgeCount) {
    e_ij = matrix(0, nrow=kMax, ncol=kMax);
    for (edgeIndex in 1:edgeCount) {
        edge = E(namedNet)[edgeIndex];
        nodes = (ends(namedNet, edge[1]));

        name_i = nodes[[1]];
        nodeIndex_i = match(name_i, namesInNet);
        i = degreeOfNodes[nodeIndex_i];

        name_j = nodes[[2]];
        nodeIndex_j = match(name_j, namesInNet);
        j = degreeOfNodes[nodeIndex_j];

        # An undirected graph should have from,to interchangeable
        e_ij[i,j] = e_ij[i,j] + 1;
        e_ij[j,i] = e_ij[j,i] + 1;
    }
    (e_ij)
}

constructMatrixUsingEdgeWeights = function(edgeWeightsOfNodes, namedNet, namesInNet, kMax, edgeCount) {
    e_ij = matrix(0, nrow=kMax, ncol=kMax);
    for (edgeIndex in 1:edgeCount) {
        edge = E(namedNet)[edgeIndex];
        nodes = (ends(namedNet, edge[1]));

        name_i = nodes[[1]];
        nodeIndex_i = match(name_i, namesInNet);
        i = edgeWeightsOfNodes[nodeIndex_i];

        name_j = nodes[[2]];
        nodeIndex_j = match(name_j, namesInNet);
        j = edgeWeightsOfNodes[nodeIndex_j];

        # An undirected graph should have from,to interchangeable
        e_ij[i,j] = e_ij[i,j] + edge$weight;
        e_ij[j,i] = e_ij[j,i] + edge$weight;
    }
    (e_ij)
}


doSumOverKPrime = function(e_k_kPrime, k, kMax) {
    total = 0;
    for (kPrime in 1:kMax) {
        total = total + e_k_kPrime[k,kPrime];
    }
    (total)
}

probabilityKPrimeGivenK = function(e_k_kPrime, kPrime, k, sumOverKPrime) {
    (e_k_kPrime[k,kPrime]/sumOverKPrime)
}

k_nn = function(e_k_kPrime, k, kMax) {
    sumOverKPrime = doSumOverKPrime(e_k_kPrime, k, kMax);
    total = 0;
    for (kPrime in 1:kMax) {
        total = total + kPrime * probabilityKPrimeGivenK(e_k_kPrime, kPrime, k, sumOverKPrime)
    }
    (total)
}
