# Introduction

In this lab exercise, you will learn how to analyse networks in Python. We will start with practicalities on loading data, and creating a network in Python. We then cover some basic analysis, and move gradually to more complex analyses.

We will be analysing an author network. We have prepared two types of networks: a collaboration network and a citation network.

<div class="alert alert-info">
This Python notebook is intended to be used as an exercise. We have prepared everything for you, but at some parts we will ask you to fill in some of the blanks. Exercises where you are asked to do something, or to think about something, will be indicated like this. If you need to execute and write your own code, we provide empty space below to do so.
</div>

<div class="alert alert-warning">
If you need any help with anything, please don't hesitate to ask your teachers. 
</div>

## Reading data

We start by loading the network data. We prepared two `tab`-separated text files for your convenience. The file `authors.txt` contains all the author information, the other two contain the data on collaboration links (`collaboration.txt`) and the data on citation links (`citations.txt`).

In order to read in this data, we first need to make sure that Python is able to read it. A very versatile *package* for handling data in Python is called `pandas`. For those of you familiar with `R`, it is similar to the `data.frame` in `R`.

We *import* this package as follows.

<div class="alert alert-success">
In order to execute the code you have to press Ctrl-Enter while selecting the code cell below. Alternatively, you can press the "Play" button at the top of the screen. This also moves to the next cell at the same time. Using Shift-Enter instead of Ctrl-Enter will also execute the code and move to the next cell at the same time.
</div>

In [None]:
import pandas as pd

<div class="alert alert-success">
If you have executed that code cell correctly, it should now be numbered `1`. While the code in that cell is being executed it is marked by an asterisk `*`. Each cell of executed code will be numbered in the order in which you execute it. If you execute it again, it will be numbered `2`, et cetera.
</div>

We are now ready to read in data. We have named the `pandas` package `pd`, which will save us some typing. We are going to read the tab-separated files using the general interface for reading comma-separated value files, normally known as `csv` files.

In [None]:
authors_df = pd.read_csv('authors.txt', sep='\t')

We called the *function* `read_csv` of the `pandas` package (which we refer to as `pd`). We provide it with two *arguments*. The first argument is the name of the file we want to read. The second argument indicates that the values are not separated by a comma in this file, but by a `tab`, which is represented as `\t`. 

The `pandas` package took care of reading the file, and has now stored it in the variable called `authors_df`. You can take a closer look at `authors_df` to see the data that we just read.

In [None]:
authors_df

After execution you will see quite a long table with multiple columns. At the bottom it shows how many rows and how many columns are stored in the table. 

Various information about the authors is provided. Besides the id and the names, it shows the total number of publications available from the WoS, the gender and the discipline in which the authors most frequently published. 

We made a selection of all authors in Belgium who have first published before 2000 and have last published after 2010. We recorded all collaborations and all citations between these authors. Collaborations or citations with other authors outside of this dataset are excluded.

<div class="alert alert-info">
Load the collaboration file. You should be able to do that yourself by filling in the `...` below. The file you want to load is called `collaboration.txt`.
</div>

In [None]:
collaboration_df = pd.read_csv('...', sep='\t')

<div class="alert alert-info">
Take a closer look at what is stored in `collaboration_df`. How many links are there in the collaboration network?
</div>

## Creating a network

We now have everything we need to construct our collaboration network. In order to work with networks, we again have to import a package. This package is called `igraph`.

<div class="alert alert-info">
Import `igraph` yourself, and call the package `ig`.
</div>

We now provide the data stored in `authors_df` and `collaboration_df` to the package `igraph` in order to construct a network. The terminology of the package is somewhat different, and a network is called a *graph*, nodes are called *vertices* and links are called *edges*. They all refer to the same respective concepts, only the terminology is different.

In [None]:
G_collaboration = ig.Graph.DictList(
      vertices=authors_df.to_dict('records'),
      edges=collaboration_df.to_dict('records'),
      vertex_name_attr='author_id',
      edge_foreign_keys=('author_idA', 'author_idB'),
      directed=False
      )

Calling the function `.to_dict('records')` changes the data format to something that `igraph` can work with. The `vertex_name_attr` denotes the column of `authors_df` that serves as the identifier of that node. The `edge_foreign_keys` indicate what columns of `collaboration_df` serves to identify the two endpoints of a collaboration link. Finally, by stating `directed=False` we indicate that we create an undirected graph,

We now have created our collaboration network! Before we go into the analysis of the network, let us first look at a brief summary of the network.

In [None]:
print(G_collaboration.summary())

The first line indicates that we have an undirected graph (`U`) with 7384 nodes and 90036 links. 

<div class="alert alert-info">
Does that match up with the number of rows from `authors_df` and `collaboration_df`?
</div>

The next line shows the columns of `authors_df` that are now available as vertex attributes (indicated by the `v` behind the name of the attribute), and the columns of `collaboration_df` that are now available as edge attributes (indicated by the `e`).

## VOSViewer

You may construct scientometric networks using VOSViewer, which you can then import for further analysis in `igraph`. In order to import the file in `igraph` you need to have saved both the `map` file and the `network` file in VOSViewer. See the manual of VOSViewer for more explanation.

In [None]:
map_df = pd.read_csv('vosviewer_map.txt', sep='\t')
network_df = pd.read_csv('vosviewer_network.txt', sep='\t', header=None)
# The network file from VOSViewer has no header, so we set it manually
network_df.columns = ['idA', 'idB', 'weight']

Now we have loaded the data, so we can simply construct a network as before.

In [None]:
G_vosviewer = ig.Graph.DictList(
      vertices=map_df.to_dict('records'),
      edges=network_df.to_dict('records'),
      vertex_name_attr='id',
      edge_foreign_keys=('idA', 'idB'),
      directed=False
      )

The layout and clustering is also stored by VOSViewer, and we can use that to display the same visualization in `igraph`.

In [None]:
layout = ig.Layout(coords=zip(*[G_vosviewer.vs['x'], G_vosviewer.vs['y']]))
clustering = ig.VertexClustering.FromAttribute(G_vosviewer, 'cluster')

ig.plot(clustering, layout=layout, vertex_size=5, vertex_label=None)

# Network analysis

Now that we loaded everything we can take a closer look to the network and try to analyse it.

## Connectivity

Let us start with a very simple question. Is the network connected?

In [None]:
G_collaboration.is_connected()

Apparently, not all authors in Belgium are connected via co-authored papers.

<div class="alert alert-info">
How many authors do you think will be connected to each other? 500? 5000? Almost everybody?
</div>

In order to take a closer look, we need to detect the *connected components*. This is easily done, but the function is confusingly called `clusters`.

In [None]:
components = G_collaboration.clusters()

We only want the so-called giant component. 

<div class="alert alert-success">
Now it is time to introduce you a little trick: you can get a list of all functions of the variable `components`, by simply typing `components.` and then press `Tab` (make sure the cursor is located after the `.`). If you then start typing the name of the function you are looking for and press `Tab` again, Python will automatically finish it as much as possible. This is something general: whenever you press `Tab` Python will try to *autocomplete* whatever you are typing.
</div>

<div class="alert alert-info">
What function do you think returns the giant component?
</div>

<div class="alert alert-success">
Time for one other trick: go back to the previous code cell, and select the function you think returns the giant component. Put the cursor anywhere on the function and press `Shift-Tab`. You now get documentation of what this function does. You can press the `+` to find out more.
</div>

Let us only look at the giant component.

In [None]:
H = components.giant()

Let us check how many nodes are in the giant component. Remember, we can again call the function `summary`.

In [None]:
print(H.summary())

The giant component includes 94% of the nodes and more than 99.9% of the links of the complete graph.

<div class="alert alert-info">
Double check whether the giant component is connected.
</div>

Let us take a closer look at how far authors in Belgium are apart from one another. Let us simply take a look at node number 0 (the first node has number 0, not 1) and node number 1. 

In [None]:
paths = G_collaboration.get_shortest_paths(0, 1)
paths

This returns all shortests paths of the nodes between node number 0 and node number 1. In fact, there is only one path, so let us select that.

In [None]:
path = paths[0]
path

<div class="alert alert-info">
How many nodes are in the path? What is the path length?
</div>

Perhaps these numbers do not mean that much to you. You can find out more about an individual node by looking at the `VertexSequence` of `igraph`, abbreviated as `vs`. This is a sort of list of all vertices, and is indexed by brackets `[ ]` instead of parentheses `( )` as we do with functions.

In [None]:
G_collaboration.vs[0]

The vertex itself is also a type of list, and you can only return the name as follows

In [None]:
G_collaboration.vs[0]['full_name']

You can also list multiple vertices at once.

In [None]:
G_collaboration.vs[[0, 2384, 6486, 1]]['full_name']

This shows that Van Acker collaborated with Janssen, who collaborated with Carlon, who in the end collaborated with Leyns.

You can also get the vertex by searching for the full name. For example, if we want to find `'janssen, f'` we can use the following.

In [None]:
G_collaboration.vs.find(full_name_eq = 'janssen, f')

Here `full_name_eq` refers to the condition that the vertex attribute `full_name` should **eq**ual `'janssen, f'`.

<div class="alert alert-info">
What is the first name of `'carlon, e'`?
</div>

<div class="alert alert-info">
Find the shortest path from `'janssen, f'` to `'leyns, l`'. Who is in between?
</div>

We can let `igraph` also calculate how far apart all nodes are.

<div class="alert alert-warning">
The following may take some time to run
</div>

In [None]:
path_lengths = G_collaboration.path_length_hist()
print(path_lengths)

<div class="alert alert-info">
How far apart are most authors? Do you think most authors are close by? Or do you think they are far apart?
</div>

Let us take a closer look at the path between node 0 and node 1 again. Instead of the nodes on the path, we now want to take a closer look at the edges on the path.

In [None]:
epath = G_collaboration.get_shortest_paths(0, 1, output='epath')
epath

There are three edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the `VertexSequence` we encountered earlier, there is also an `EdgeSequence`, abbreviated as `es`. Let us take a closer look to the number of joint papers that the authors had co-authored.

In [None]:
G_collaboration.es[epath[0]]['n_joint_papers']

Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?

In [None]:
epath = G_collaboration.get_shortest_paths(0, 1, weights='n_joint_papers', output='epath')
epath

We do get a path that is longer. Let us take a look at the number of joint papers.

In [None]:
G_collaboration.es[epath[0]]['n_joint_papers']

The number of joint papers is lower! That is because 'shortest path' means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the 'shortest path'.

<div class="alert alert-danger">
<b>Attention!</b> Weighted shortest paths have the *lowest* total weight.
</div>

## Clustering

We already saw that most authors tend to be connected through no more than 3 or 4 people. As we encountered in the lecture, this is one of the signs of a small-world network. The other one is clustering: whether co-authors of an author also tend to be co-authors among themselves.

Let us take a look at the co-authors of of author number 0, which are called the *neighbors* in network terminology.

In [None]:
G_collaboration.neighborhood(0)

What we actually want to know is whether many of those neighors are connected. That is, we want to take the subgraph of all authors that have co-authored with author number 0.

In [None]:
H = G_collaboration.induced_subgraph(G_collaboration.neighborhood(0))
print(H.summary())

This subgraph only has 13 nodes (including node 0, so it has 12 neighbours) and 35 edges. This is sufficiently small to be easily plotted for visual inspection.

In [None]:
H.vs['color'] = 'red'
H.vs[0]['color'] = 'grey'
ig.plot(H)

<div class="alert alert-info">
Do many of the co-authors collaborate among themselves as well? Why do you think this happens?
</div>

We can also ask igraph to calculate the clustering coefficient (which is called *transitivity* in igraph, which is the same concept using different terms) of node 0.

In [None]:
G_collaboration.transitivity_local_undirected(0)

<div class="alert alert-info">
What percentage of the co-authors of node 0 have also written papers with each other?
</div>

Let us now calculate this for all nodes.

In [None]:
G_collaboration.transitivity_avglocal_undirected()

<div class="alert alert-info">
What percentage of the co-authors have also written papers with each other on average? Can we conclude that the graph looks like a small-world network?
</div>

## Centrality

Often, people want to identify wich nodes seem to be most important in some way in the network. This is often thought of as a type of *centrality* of a node.

### Degree

The simplest type of centrality is the *degree* of a node, which is simply the number of its neighbors. Previously, we saw that node 0 had 12 neighbors, we therefore say its degree is 12.

In [None]:
G_collaboration.degree(0)

We can also simply calculate the degree for everybody and store it in a new vertex attribute called `degree`.

In [None]:
G_collaboration.vs['degree'] = G_collaboration.degree()

<div class="alert alert-info">
What is the degree of `carlon, e`?
</div>

We can also take a look at the complete degree distribution. To plot it, we load the `matplotlib` package. We import the plotting functionality and name the package `plt`. We also include a statement telling Python to show the plots immediately in this Notebook.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Now let us plot a histogram of the degree, using 50 bins.

In [None]:
plt.hist(G_collaboration.vs['degree'], 50);

This clearly shows that the degree distribution is quite skewed. Most authors have only few collaborators, while a few authors have many collaborators. If the degree distribution is so skewed, it is sometimes referred to as a "scale-free" network.

The code below sorts the nodes in descending order of the degree.

In [None]:
highest_degree = sorted(G_collaboration.vs, key=lambda v: v['degree'], reverse=True)

The `sorted` function takes a list as input, `G_collaboration.vs` in our case, and sorts it according to a sort key. We indicate the sort key by a small function, called a `lambda` function, that returns the degree. In other words, the `sorted` function will sort the nodes according to the degree. By indicating `reverse=True` we obtain a list that is sorted highest to lowest, instead of the other way around.

You can look at the first five results in the following way.

In [None]:
highest_degree[:5]

So, apparently, P. Carmeliet has collaborated with 235 other authors from Belgium! This of course only considers the number of co-authors, it does not take into account the number of papers written with somebody else.
When specifying such *edge weights* like the number of joint papers, the weighted degree is referred to as the *strength* of a node (which is sometimes a bit confusing term). 

Let us look at the strength of node 0.

In [None]:
G_collaboration.strength(0, weights='n_joint_papers')

Apparently, author 0 collaborated with 12 different authors, and has a total strength of 30. But what does this 30 mean? We need to carefully think about this. Suppose that author 0 has co-authored a single publication with three other co-authors, then each of the three co-authors would have an edge weight of `n_joint_papers = 1`. So, the strenght would be 3. Hence, the strength denotes the total number of collaborations that an author had, which depends both on the number of publications and the number of collaborators per paper.

Sometimes, we wish to take into account the number of co-authorships when creating a link weight. We can then fractionally count the weight of each collaboration between $n_a$ authors as

$$\frac{1}{n_a - 1}.$$

This is also available as the vertex attribute `n_joint_papers_frac`.

In [None]:
G_collaboration.strength(0, weights='n_joint_papers_frac')

<div class="alert alert-info">
What is the sum of $\frac{1}{n_a - 1}$ over all co-authors? Then shouldn't the strength sum up to a whole number? Why isn't that the case here?
</div>

### Betweenness centrality

Betweenness centrality is much more elaborate, and gives an indication of the number of times a node is on the shortest path from one node to another node.

As you can imagine, this can take quite some time to calculate for all nodes. We only calculate it for node 0.

<div class="alert alert-warning">
The following may take some time to run
</div>

In [None]:
G_collaboration.betweenness(vertices=0)

As we did previously when dealing with shortest paths, we can also use a weight for determining the shortest paths. As we already found out, doing so actually does not correspond with our intuition that shorter paths are more strongly connected. The betweenness centrality defined in this way is therefore not very informative.

<div class="alert alert-danger">
<b>Attention!</b> Weighted shortest paths have the *lowest* total weight.
</div>

<div class="alert alert-warning">
Because betweenness centrality takes such a long time to run, we will not look at the most central authors according to betweenness centrality.
</div>

### Pagerank

One way of identifying central nodes relies on the idea of a random walk in a network. When performing such a random walk, we simply go from one author to the next, following the co-authorship links. The author that is most frequently visited during such a random walk is then seen as most central. This is actually the idea that underlies Google's famous search engine. Luckily, we can compute that a lot faster than betweenness.

In [None]:
G_collaboration.pagerank(0)

We can again calculate pagerank for all authors.

In [None]:
G_collaboration.vs['pagerank'] = G_collaboration.pagerank()

<div class="alert alert-info">
Get the top 5 most central authors according to Pagerank. Who is the most central? Are the results very different from the degree?
</div>

We can again take into account the weights. In pagerank this means that a co-author that is a more frequent co-author will be more likely to be visited during a random walk. This is actually much more in line with our intuition than the shortest path. Let us see what we get if we do that.

In [None]:
G_collaboration.vs['pagerank_weighted'] = G_collaboration.pagerank(weights='n_joint_papers')

<div class="alert alert-info">
Are the results different for the weighted version of pagerank? Take a closer look to the discipline of the authors with the highest Pagerank. What do you think of this result? Can you explain why these authors come from those disciplines?
</div>

## Homophily

One of the analyses that people are frequently interested in is whether people that share some attribute are more likely to be connected. We here study the gender divide in science. In our collaboration network, the question is whether men are more likely to work with men and women are more likely to work with women.

In social network analysis, this question can be answered by looking at the *assortativity*. This is a measure that indicates whether people of the same type are more likely to connect (positive assortativity) or less likely to connect (negative assortativity). The assortativity varies between -1 and 1 and can be seen as a type of correlation.

Let us see what we obtain for our collaboration network here.

In [None]:
G_collaboration.assortativity_nominal(ig.VertexClustering.FromAttribute(G_collaboration, 'gender').membership)

Based on this measure, it seems that men and women collaborate quite evenly.

<div class="alert alert-info">
Are people from the same discipline more likely to collaborate according to the assortativity measure?
</div>

Let us see whether stronger collaborations are even more likely to occur within the same discipline.

In [None]:
H = G_collaboration.subgraph_edges(G_collaboration.es.select(n_joint_papers_gt = 10))

Here, we use the `select` function of the `EdgeSequence` to select edges. The arguments `n_joint_papers_gt = 10` refers to the condition that `n_joint_papers` is **g**reater **t**han `10`. We then use the function `subgraph_edges` to create a new graph with only those selected edges.

<div class="alert alert-info">
Now look at the disciplinary assortativity of the stronger collaboration network. Is it stronger or not?
</div>

<div class="alert alert-info">
Look at gender assortativity of the stronger collaboration network. How does it compare to the overall gender assortativity?
</div>

Perhaps the absense of any gender divide in collaboration is a consequence of disciplinary differences.

First, let us create different subgraphs for each discipline. We do that by creating a `VertexClustering` of `G_collaboration` according to the `discipline` attribute.

In [None]:
disciplines = ig.VertexClustering.FromAttribute(G_collaboration, 'discipline')

Now, we can calculate the gender assortativity for each discipline separately. We will put it in a nice table using the `pandas` package.

In [None]:
gender_assortativity_df = pd.DataFrame([
    {'discipline': H.vs['discipline'][0], 
     'n_nodes': H.vcount(),
     'n_edges': H.ecount(),
     'gender_assortativity': H.assortativity_nominal(ig.VertexClustering.FromAttribute(H, 'gender').membership)
    }
 for H in disciplines.subgraphs()])
gender_assortativity_df

<div class="alert alert-info">
What do you conclude about gender homophily on the basis of these statistics?
</div>

Besides assortativity based on gender or discipline, we can also look at assortativity based on degree. Do authors that collaborate a lot tend to collaborate with other authors that collaborate a lot? Or is it the other way around, and do authors that collaborate little tend to collaborate with authors that collaborate a lot?

This can be answered by the degree assortativity.

In [None]:
G_collaboration.assortativity_degree()

### Citation network

Let us load the citation data to study the question of gender homopily in citations.

In [None]:
citation_df = pd.read_csv('citation.txt', sep='\t')

G_citation = ig.Graph.DictList(
      vertices=authors_df.to_dict('records'),
      edges=citation_df.to_dict('records'),
      vertex_name_attr='author_id',
      edge_foreign_keys=('author_id_from', 'author_id_to'),
      directed=True
      )

This is a directed network. We can check whether authors are likely to reciprocate citations: if authors A cites B, author B is likely to cite author A also.

In [None]:
G_citation.reciprocity()

You can also look at assortativity according to some numerical value, using the `assortativity` function.

<div class="alert alert-info">
Try to see if collaboration is assortative with respect to the total number of incoming citations.
</div>

<div class="alert alert-info">
Look at gender assortativity of the citation network. What do you conclude?
</div>

<div class="alert alert-info">
Create a graph of stronger citations (having `n_cits` greater than `15`). Look at gender assortativity of the citation network. What do you conclude?
</div>

## Weak links

The weak link theory we discussed in class had two elements:

1. Weak links tend to connect people from different groups.
2. New information enter groups through weak links.

Let us start by analysing whether weak collaborations tend to connect people from different disciplines.

<div class="alert alert-info">
You already studied this question to some extent through the disciplinary assortativity. What do you expect regarding the strength of collaborations between authors from the same discipline versus authors from different disciplines?
</div>

Let us look at whether the average collaboration is stronger within disciplines than between disciplines. We again create a `VertexClustering` on the basis of the `discipline` vertex attribute.

In [None]:
disciplines = ig.VertexClustering.FromAttribute(G_collaboration, 'discipline')

We now want to calculate the mean number of joint papers across disciplinary boundaries, and within disciplinary boundaries. To calculate the mean, the `numpy` package comes in handy, which is a general purpose numerical calculation package. Let us import that first, and call it `np`.

In [None]:
import numpy as np

Now we can calculate the mean `n_joint_papers` that are crossing the disciplinary boundaries.

In [None]:
edge_crossing = disciplines.crossing()
np.mean([e['n_joint_papers'] for e in G_collaboration.es if edge_crossing[e.index]])

Let us compare that to the mean that does not cross disciplinary boundaries.

In [None]:
np.mean([e['n_joint_papers'] for e in G_collaboration.es if not edge_crossing[e.index]])

Clearly, collaborations that cross disciplinary boundaries tend to be weaker: the average numer of joint papers within disciplinary boundaries is more than twice as high than across disciplinary boundaries.

<div class="alert alert-info">
How should you interpret this result? What would the result look like if there is only a single collaboration of `n_joint_papers = 10` across disciplines?
</div>

Will something similar also hold at a much finer scale?

Let us look at whether a link that connects authors that share many co-authors tends to be stronger? In other words, if co-authors share many other co-authors, do they tend to collaborate more frequently?

In [None]:
G_collaboration.es['similarity_jaccard'] = G_collaboration.similarity_jaccard(pairs=[e.tuple for e in G_collaboration.es])

Using `numpy` we can easily calculate the correlation coefficient between the Jaccard similarity and the number of joint papers.

In [None]:
np.corrcoef(np.log(G_collaboration.es['similarity_jaccard']), 
            np.log(G_collaboration.es['n_joint_papers']))

Indeed, also at such a detailed scale, it seems that people that share many collaborators tend to collaborate more frequently.

Let us plot the results.

In [None]:
plt.loglog(G_collaboration.es['similarity_jaccard'], 
           G_collaboration.es['n_joint_papers'], 
           '.', alpha=0.05);
plt.xlabel('Jaccard Similarity')
plt.ylabel('Number of joint papers');

As is often the case, the second part of the theory on weak links is much more difficult to study. It requires us to collect data on information that is spreading across co-authors. We will therefore not go into this here.

## Community detection

A common phenomenon in many networks is the presence of group structure, where nodes within the same group are densely connected. Such a structure is sometimes called a *modular* structure, and a frequently used measure of group structure is known as *modularity*. We will briefly look at group structure in both collaboration and citation.

First, we will import a package called `louvain` which we will use for community detection, which is a bit more general than the methods implemented in `igraph`. It is built on top of `igraph` so that it easily integrates with all the exisiting methods of `igraph`.

In [None]:
import louvain

Now let us detect communities in the collaboration network, using the `n_joint_papers_frac` as the weight of the edges. Because the algorithm is stochastic, it may yield somewhat different results every time you run it. To prevent that from happening, and to always get the same result, we will set the random seed to 0. The result is a `VertexClustering`, which we already briefly encountered when studying homophily.

In [None]:
louvain.set_rng_seed(0)
communities = louvain.find_partition(G_collaboration, louvain.ModularityVertexPartition,
                                     weights='n_joint_papers_frac')

The length of the `communities` variable indicates the number of communities.

In [None]:
len(communities)

When accessing `communities` variable as a list, each element corresponds to the set of nodes in that community.

In [None]:
communities[50]

Hence, node `318` and `6374` belong to community `50`. Another way to look at the communities is by looking at the `membership` of the `VertexClustering`.

In [None]:
communities.membership[:10]

Hence, node `0` belongs to community `6`, node `1` belongs to community `0`, node `2` belongs to community `15`, et cetera.

Let us take a closer look at the largest community.

In [None]:
H = communities.giant()
print(H.summary())

We could again detect communities using modularity in the largest community.

In [None]:
subcommunities = louvain.find_partition(H, louvain.ModularityVertexPartition, weights='n_joint_papers_frac')
ig.plot(subcommunities, vertex_size=5, vertex_frame_width=0, edge_width=np.sqrt(H.es['n_joint_papers_frac'])/5)

In general, modularity will continue to find subcommunities in this way. An alternative approach, called CPM, does not suffer from that problem. 

Let us detect communities using CPM. We do have to specify a parameter, called the `resolution_parameter`. As its name suggests, it specifies the resolution of the communities we would like to find. At a higher resolution we will tend to find smaller communities, while at a lower resolution we find larger communities. Let us use the resolution parameter 0.01.

In [None]:
louvain.set_rng_seed(0)
communities = louvain.find_partition(G_collaboration, louvain.CPMVertexPartition,
                                     weights='n_joint_papers_frac',
                                     resolution_parameter=0.01)
communities.giant().vcount()

<div class="alert alert-info">
Detect subcommunities in the largest community using CPM. How many subcommunities do you find? How does that compare to modularity?
</div>

<div class="alert alert-info">
Try to find more subcommunities by specifying a higher `resolution_parameter`.
</div>

Modularity adapts itself to the network. In a sense that is convenient, because you then do not have to specify any parameters. On the other hand, it makes the definition of what a "community" is less clear.

CPM does not adapt itself to the network, and maintains the same defintion across different networks. That is convenient, because it brings more clarity to what we mean by a "community". Whenever you try to find subcommunities using the same `resolution_parameter`, CPM should (generally) not find any subcommunities.

Let us look at the disciplinary background of the ten largest communities.

In [None]:
membership_df = pd.DataFrame({'membership': communities.membership,
                              'discipline': G_collaboration.vs['discipline']})
membership_df.groupby(['membership', 'discipline']).size().unstack('discipline')[:10].fillna(0)

It seems there are multiple communities that mostly involve Clinical Medicine, and at least one that mostly involves Physics and Materials Science.

We can summarize the overall similarity to the partition based on the disciplines using the Normalised Mutual Information (NMI). The NMI varies between 0 and 1 and equals 1 if both are identical.

In [None]:
communities.compare_to(disciplines, method='nmi')

Now let us use the detected communities for further analysis.

<div class="alert alert-info">
Do citations tend to be given within collaboration communities?
</div>

<div class="alert alert-info">
Vary the `resolution_parameter` when detecting communities using the CPM method. What `resolution_parameter` seems reasonable to you, and why?
</div>

<div class="alert alert-info">
Detect communities in the citation network. Are the detected communities similar to the collaboration or not?
</div>

## Structural holes

As discussed during class, people that fill structural holes occupy a particular position in the network. They are able to broker information between people that would otherwise not be connected. We can quantify this based on the clustering coeffient.

Typically, nodes with a higher degree will show a lower clustering degree. This is partly due to the fact that the higher number of neighbors simply will not tend to all collaborate with each other. If there are only 4 or 5 co-authors, it is not unlikely they will all collaborate among each other. However, if you get to 100 co-authors, it is rather unlikely that all of them collaborate.

In [None]:
plt.plot(G_collaboration.degree(), 
         G_collaboration.transitivity_local_undirected(mode='zero'), 
         '.', alpha=0.1)
plt.xscale('log')
plt.xlabel('Degree')
plt.ylabel('Clustering coefficient');

Now let us see if authors that tend to connect different people are more frequently cited on average.

First let us calculate the average number of citations.

In [None]:
G_citation.vs['total_cits'] = G_citation.strength(weights='n_cits')
G_citation.vs['avg_cits'] = np.array(G_citation.vs['total_cits'])/np.array(G_citation.vs['n_pubs'])

<div class="alert alert-info">
Calculate the correlation between the average number of citations and the clustering coefficient.
Also plot the two against each other. What do you conclude?
</div>

## Citations and co-authorship

Let us delve a bit further into the relationship between citations and co-authorship.

First, let us see how the citation degree compares to the collaboration degree. For the citation degree, there are two variants: the in-degree and the out-degree. For example, the in-degree counts the number of authors that cite 'P. Carmeliet', while the out-degree counts the number of authors cited by 'P. Carmeliet'.

In [None]:
f = plt.figure(figsize=(18,6))
ax = f.add_subplot(1, 2, 1)
ax.plot(G_collaboration.degree(), G_citation.degree(mode='in'), '.')
ax.set_xlabel('Collaboration degree')
ax.set_ylabel('Citation in-degree')

ax2 = f.add_subplot(1, 2, 2)
ax2.plot(G_collaboration.degree(), G_citation.degree(mode='out'), '.')
ax2.set_xlabel('Collaboration degree')
ax2.set_ylabel('Citation out-degree');

Especially the citation out-degree correlates well with the collaboration degree. The number of authors that are cited by an author is roughly proportional to the number of collaborators of an author. That raise the question: do people tend to cite their co-authors? Going one step further: do people tend to cite co-authors of their co-authors?

Let us try to analyse that a bit further. We only examine this on the largest community, because we otherwise will not have sufficient computational resources available for this analysis.

In [None]:
comms = G_collaboration.community_multilevel()

max_comm, max_size = max(enumerate(comms.sizes()), key=lambda x: x[1])

# Define the collaboration and citation network of the largest community
H_collaboration = G_collaboration.induced_subgraph(comms[max_comm])
H_citation = G_citation.induced_subgraph(comms[max_comm])

We first ask `igraph` to calculate the shortest path distance between *all* pairs of nodes. We will store this as a two-dimensional `numpy` array.

In [None]:
dist = np.array(H_collaboration.shortest_paths())
dist

Now, let us look at the number of citations for all pairs of nodes. We also store this as a two-dimensional `numpy` array.

In [None]:
n_cits = np.array(H_citation.get_adjacency('n_cits').data)
n_cits

<div class="alert alert-info">
What do you immediately observe about self-citations?
</div>

We create a dataframe for both variables: each row will correspond to a pair of nodes.

In [None]:
n_cits_per_dist_df = pd.DataFrame({'n_cits': np.reshape(n_cits, -1),
                                   'dist': np.reshape(dist, -1)})

We plot the total number of citations at a certain distance and the expected number of citations at a certain distance.

In [None]:
total_cit_per_dist = n_cits_per_dist_df.groupby('dist').sum()
dist_hist = n_cits_per_dist_df.groupby('dist').size()

plt.plot(total_cit_per_dist.index, total_cit_per_dist, marker='o', label='Total cits.')
plt.plot(dist_hist.index, (dist_hist/dist_hist.sum())*n_cits.sum(), marker='*', label='Total exp. cits.')
plt.yscale('log')
plt.legend(loc='best');

<div class="alert alert-info">
What do you conclude from this analysis?
</div>

## Social influence

Let us briefly examine a simple model of social influence. We will assume every author will have some opinion, but it switches to the majority opinion in its neighborhood when it reconsiders its opinion. Let us start with some random initial opinions

In [None]:
G_collaboration.vs['opinion'] = np.random.choice(50, G_collaboration.vcount())

<div class="alert alert-info">
Given a node `v`, how do we get a list of all opinions of the neighbors?
</div>

In [None]:
v = G_collaboration.vs[0]
opinions = ...

You can get the majority opinion as follows

In [None]:
majority_opinion = np.argmax(np.bincount(opinions))

We are now going to integrate everything in a function that loops over all nodes, and assigns it the majority opinion.

<div class="alert alert-info">
Use the answers to the previous questions to fill in the blanks.
</div>

In [None]:
def influence_majority_opinion(G):
    for v in G.vs:
        if (v.degree() > 0):
            opinions = ...
            majority_opinion = ...
            v['new_opinion'] = majority_opinion
        else:
            v['new_opinion'] = v['opinion']
    G.vs['opinion'] = G.vs['new_opinion']    

We can now update the opinion of every node by simply runnning the function `influence_majority_opinion`.

In [None]:
influence_majority_opinion(G_collaboration)

<div class="alert alert-info">
What is the assortativity of the opinion? Run the majority opinion influence a couple of times. What is the assortativity then? Would you concludere these is some opinion homophily in the collaboration network?
</div>

Instead of a type of social influence where people change there minds, researchers are often interested in the diffusion of ideas or innovations. Let us create a simple function that again does one step of social influence. Let us simply assume that if anybody in the neighborhood of a node adopted, that node will adopt with probability $p$.

This is how it works for a single node `v`.

In [None]:
v = G_collaboration.vs[0]
p = 0.5
if any(G_collaboration.vs[G_collaboration.neighbors(v)]['adopted']):
    if (np.random.rand() < p):
        v['new_adopted'] = True

<div class="alert alert-info">
Now create a function similar to `influence_majority_opinion` that updates the `adopted` vertex attribute for all nodes.
</div>

In [None]:
def adopt(G, p):
    ...

Now let us initialy set one node to adopt some idea or innovation.

In [None]:
G_collaboration.vs['adopted'] = False
G_collaboration.vs[0]['adopted'] = True

Simply doing one round of adoption then leads to some diffusion

In [None]:
adopt(G_collaboration, 0.5)
np.bincount(G_collaboration.vs['adopted'])

<div class="alert alert-info">
<p>Plot the total number of authors that have adopted after rounds 1 to 20.</p>

<p>You can construct an empty list by saying `n_adopted = []` and you can append an element to the list using the `append` function. Finally, by doing something like `for time in range(20):`, Python will execute the subsequent indented code  with `time=0`, `time=1`, ..., until `time=19`.</p>
</div>

<div class="alert alert-info">
By setting `v['adopted'] = True` for a single node `v` initially, the diffusion of course depends on which node `v` you choose. Try to find nodes for which the diffusion is fast. Similarly, try to find nodes for which the diffusion is slower. What measures we encounted earlier are good indicators of this fast or slow diffusion? How much does the selection of the initially adopted node matter?
</div>

# Further analysis

You now have a basic understanding of the possibilities for network analysis using `igraph`. Now try to do some analysis on your own. Below are some suggestions for further analysis, but feel free to pursue your own ideas.

<div class="alert alert-success">
You can introduce new code cells via the `Insert` menu, and remove them via the `Edit` menu. If you want to save anything you did, please make sure you download the notebook via the `File` menu.
</div>

<div class="alert alert-info">
Are highly cited researchers more central in the collaboration network?
</div>

<div class="alert alert-info">
Do researcher tend to collaborate with authors that their collaborators frequently cite?
</div>

<div class="alert alert-info">
Some researchers show more interdisciplinary collaborations. Do such researchers tend to collaborate more with other interdisciplinary collaborating authors?
</div>

<div class="alert alert-info">
If you remove collaboration links, the network will start to fall apart. What links are more important to maintain connectivity: weaker links or stronger links?
</div>

<div class="alert alert-info">
Some researchers may not collaborate, even though they share many collaborators. Identify such edges, and analyse whether they also do not cite each other.
</div>