# Introduction

In this lab exercise, you will learn how to analyse networks in Python. We will first start by some practicalities on loading data, and creating a network in Python. We then will start with some basic analysis, and move gradually to more complex analyses.

We will be analysing an author network. We have prepared two types of networks: a collaboration network and a citation network.

This Python notebook is intended to be use as an exercise. We have prepared everything for you, but at some parts we will ask you to fill in some of the blanks. If you need any help with anything, please don't hesitate to ask your teachers.

## Reading data

We now first start with loading the network data. We prepared two `tab`-separated text files for your convenience. The file `authors.txt` contains all the author information, the other two contain the data on the collaboration links (`collaboration.txt`) and the citation links (`citations.txt`).

In order to read in this data, we first need to make sure that Python is able to read it. A very versatile *package* for handling data in Python is called `pandas`. For those of you familiar with `R`, it is similar to the `data.frame` in `R`.

We *import* this package as follows:

<div class="alert alert-success">
In order to execute the code you have to press Ctrl-Enter while selecting the code cell below. Alternatively, you can press the "Play" button at the top of the screen. This also moves to the next cell at the same time. Using Shift-Enter instead of Ctrl-Enter will also execute the code and move to the next cell at the same time.
</div>

In [1]:
import pandas as pd

<div class="alert alert-success">
If you have executed that code cell correctly, it should now be numbered `1`. While the code in that cell is executed it is marked by an asterisk `*`. Each cell of executed code will be numbered in the order in which you execute it. If you execute it again, it will be numbered `2`, et cetera.
</div>

We are now ready to read in data. We have named the `pandas` package `pd`, which will save us some typing. We are going to read the tab-separated files using the general interface for reading comma-separated value files, normally known as `csv` files.

In [2]:
authors_df = pd.read_csv('authors.txt', sep='\t')

We called the *function* `read_csv` of the `pandas` package (which we refer to as `pd`). We needed to provide it with two *arguments*, the first arguments it the name of the file we want to read. The second argument indicates that the values are not separated by a comma in this file, but by a `tab`, which is represented as `\t`. 

The `pandas` package took care of reading the file, and has now stored it in the variable called `authors_df`. You can now take a closer look at authors_df to see the data that we just read.

In [3]:
authors_df

Unnamed: 0,author_id,full_name,first_name,n_pubs,gender,discipline
0,1629293,"van acker, k",kristien,31,f,CLINICAL MEDICINE
1,1659248,"leyns, l",luc,31,m,BASIC LIFE SCIENCES
2,1683873,"thienpont, h",hugo,341,m,PHYSICS AND MATERIALS SCIENCE
3,1712696,"gobin, a",anne,37,f,ENVIRONMENTAL SCIENCES AND TECHNOLOGY
4,3221178,"fally, s",sophie,30,f,CHEMISTRY AND CHEMICAL ENGINEERING
5,3244878,"houf, k",kurt,120,m,BASIC LIFE SCIENCES
6,3250712,"crine, m",michel,95,m,CHEMISTRY AND CHEMICAL ENGINEERING
7,3258114,"kesteloot, c",chris,3,m,ENVIRONMENTAL SCIENCES AND TECHNOLOGY
8,3281578,"deplaen, e",etienne,75,m,BASIC LIFE SCIENCES
9,5008328,"van steen, a",andre,35,m,CLINICAL MEDICINE


After execution you will see quite a long table with multiple columns. At the bottom it is displayed how many rows and how many columns are stored in the table. The columns provide different types of information about the authors.

Besides the id and the names, it shows the total number of publications available from the WoS, the gender and the discipline in which the authors most frequently published. 

We made a selection of all authors in Belgium who have first published before 2000 and have last published after 2010.

Let us now try to load the collaboration. You should be able to do that yourself by filling in the blanks below. The file you want to load is called `collaboration.txt`.

In [4]:
collaboration_df = pd.read_csv('collaboration.txt', sep='\t')

Take a closer look at what is stored in `collaboration_df`.

In [5]:
collaboration_df

Unnamed: 0,author_idA,author_idB,n_joint_papers,n_joint_papers_frac
0,336996,690940,4,0.396465
1,2366,870892,1,0.125000
2,1105541,1229707,1,0.200000
3,228908,1250652,1,0.090909
4,587063,1510779,1,0.125000
5,957212,1510779,1,0.200000
6,1185134,1525771,3,0.235354
7,1150300,1578924,10,1.961905
8,1402267,1590098,1,0.166667
9,1496522,1619435,1,0.166667


How many links are there in the collaboration network?

## Creating a network

We now have everything we need to construct our collaboration network. In order to work with networks, we again have to import another package. This package is called `igraph`. Try to import it yourself, and call the package `ig`.

In [6]:
import igraph as ig

We now provide the data stored in `authors_df` and `collaboration_df` to the package `igraph` in order to construct a network. The terminology of the package is somewhat different, and a network is called a *graph*, nodes are called *vertices* and links are called *edges*. They all refer to the same respective concepts, only the terminology is different.

In [7]:
G_collaboration = ig.Graph.DictList(
      vertices=authors_df.to_dict('records'),
      edges=collaboration_df.to_dict('records'),
      vertex_name_attr='author_id',
      edge_foreign_keys=('author_idA', 'author_idB'),
      directed=False
      )

Calling the function `.to_dict('records')` changes the data format to something that `igraph` can work with. The `vertex_name_attr` denotes the column of `authors_df` that serves as the identifier of that node. The `edge_foreign_keys` indicate what columns of `collaboration_df` serves to identify the two endpoints of a collaboration link. Finally, by stating `directed=False` we indicate that we create an undirected graph,

We now have created our collaboration network! Before we go into the analysis of the network, let us first look at a brief summary of the network.

In [8]:
G_collaboration.summary()

'IGRAPH U--- 7384 90036 -- \n+ attr: author_id (v), discipline (v), first_name (v), full_name (v), gender (v), n_pubs (v), author_idA (e), author_idB (e), n_joint_papers (e), n_joint_papers_frac (e)'

The first part indicates that we have an undirected graph (`U`) with 7384 nodes and 90036 links. Does that match up with the number of rows from `authors_df` and `collaboration_df`?

After that, you will see the columns of `authors_df` that are now available as vertex attributes (indicated by the `v` behind the name of the attribute), and the columns of `collaboration_df` that are now available as edge attributes (indicated by the `e`).

# Network analysis

Now that we loaded everything we can start taking a closer look to the network and try to analyse it.

## Connectivity

Let us start with a very simple question. Is it connected?

In [None]:
G_collaboration.is_connected()

Apparently, not all authors in Belgium are connected via co-authored papers. How many authors do you think will be connected to each other? 

In order to take a closer look, we need to detect the connected components. This is easily done, but the function is confusingly called `clusters`.

In [None]:
components = G_collaboration.clusters()

We only want the so-called giant component. 

<div class="alert alert-success">
Now it is time to introduce you a little trick: you can get a list of all functions of the variable `components`, by simply typing `components.` and then press `Tab` (make sure the cursor is located after the `.`). If you then start typing the name of the function you are looking for and press `Tab` again, Python will automatically finish it as much as possible.
</div>

In [None]:
components.giant()

What function do you think returns the giant component?

<div class="alert alert-success">
Time for one other trick: go back to the previous code cell, and select the function you think returns the giant component. Put the cursor anywhere on the function and press `Shift-Tab`. You now get a little bit more documentation of what this function does. You can press the `+` to find out more.
</div>

Now let us only look at the giant component.

In [None]:
H = components.giant()

Let us check how many nodes are in the giant component. Remember, we can again call the function `.summary`.

In [None]:
H.summary()

The giant component includes over 94% of the nodes and more than 99.9% of the links of the complete graph.

Let us take a closer look at how far authors in Belgium are apart from one another. Let us simply take a look at node number 0 (the first node has number 0, not 1) and node number 1. 

In [9]:
path = G_collaboration.get_shortest_paths(0, 1)
path

[[0, 2531, 6863, 1]]

This returns the path of the nodes between node number 0 and node number 1. How many nodes are in the path? What is the path length?

Perhaps these numbers do not mean that much to you. You can find out more of an individual node by looking at the `VertexSequence` of `igraph`, abbreviated as `vs`. This is a sort of list of all vertices, and is indexed by brackets `[ ]` instead of using parentheses `( )` as we do with functions.

In [10]:
G_collaboration.vs[0]

igraph.Vertex(<igraph.Graph object at 0x0000000007A7B4F8>, 0, {'author_id': 1629293, 'full_name': 'van acker, k', 'first_name': 'kristien', 'n_pubs': 31, 'gender': 'f', 'discipline': 'CLINICAL MEDICINE'})

The vertex itself is also a type of list, and you can only return the name as follows

In [11]:
G_collaboration.vs[0]['full_name']

'van acker, k'

In fact, you can also list multiple vertices at once.

In [12]:
G_collaboration.vs[[0, 2384, 6486, 1]]['full_name']

['van acker, k', 'janssen, f', 'carlon, e', 'leyns, l']

This shows that Van Acker collaborated with Janssen, who collaborated with Carlon, who in the end collaborated with Leyns.

We can let `igraph` also calculate how far apart all nodes are.

<div class="alert alert-warning">
The following may take some time to run
</div>

In [13]:
path_lengths = G_collaboration.path_length_hist()
print(path_lengths)

N = 24314899, mean +- sd: 3.7926 +- 1.0931
Each * represents 148751 items
[ 1,  2):  (90036)
[ 2,  3): *********** (1782624)
[ 3,  4): ********************************************************* (8593271)
[ 4,  5): ********************************************************** (8627590)
[ 5,  6): ************************ (3702347)
[ 6,  7): ******* (1072567)
[ 7,  8): ** (307843)
[ 8,  9):  (94635)
[ 9, 10):  (31038)
[10, 11):  (10430)
[11, 12):  (1962)
[12, 13):  (420)
[13, 14):  (105)
[14, 15):  (21)
[15, 16):  (10)


How far apart are most authors? Do you think most authors are close by? Or do you think they are far apart?

Let us briefly take a closer look at the path between node 0 and node 1 again. Now, instead of the nodes on the path, we want to take a closer look at the edges on the path.

In [None]:
epath = G_collaboration.get_shortest_paths(0, 1, output='epath')
epath

There are three edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the `VertexSequence` we encountered earlier, there is also an `EdgeSequence`, abbreviated as `es`. Let us take a closer look to the number of joint papers that the authors had co-authored.

In [None]:
G_collaboration.es[epath[0]]['n_joint_papers']

Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?

In [None]:
epath = G_collaboration.get_shortest_paths(0, 1, weights='n_joint_papers', output='epath')
epath

We do get a path that is longer. Let us take a look at the number of joint papers.

In [None]:
G_collaboration.es[epath[0]]['n_joint_papers']

The number of joint papers is lower! That is because 'shortest path' means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the 'shortest path'.

<div class="alert alert-danger">
<b>Attention!</b> Weighted shortest paths have the *lowest* total weight.
</div>

## Clustering

We already saw that most authors tend to be connected through no more than 3 or 4 people. As we encountered in the lecture, this is one of the signs of a small-world network. The other one is clustering: whether co-authors of an author also tend to be co-authors among themselves.

Let us take a look at the co-authors of of author number 0, which are called the *neighbors* in network terminology.

In [None]:
G_collaboration.neighborhood(0)

Now, what we actually want to know is whether many of those neighors are connected. That is, we want to take the subgraph of all authors that have co-authored with author number 0.

In [None]:
H = G_collaboration.induced_subgraph(G_collaboration.neighborhood(0))
H.summary()

This subgraph only has 13 nodes (including node 0, so it has 12 neighbours) and 35 edges. This is sufficiently small to be easily plotted for visual inspection.

In [None]:
H.vs['color'] = 'red'
H.vs[0]['color'] = 'grey'
ig.plot(H)

Do many of the co-authors collaborate among themselves as well? Why do you think this is the case?

We can also ask igraph to calculate the clustering coefficient (which is called *transitivity* in igraph, which is the same concept using different terms) of node 0.

In [None]:
G_collaboration.transitivity_local_undirected(0)

This shows that 35% of the co-authors of node 0 have also written papers with each other.

Let us now calculate this for all nodes.

In [None]:
G_collaboration.transitivity_avglocal_undirected()

On average about 38% of the co-authors of an author have also writter papers with each other.

We can conclude that the graph shows signs of a small-world network.

## Centrality

Often, people want to identify wich nodes seem to be most important in some way in the network. This is often thought of as a type of *centrality* of a node.

### Degree

The simplest type of centrality is the *degree* of a node, which is simply the number of its neighbours. Previously, we saw that node 0 had 12 neighbours, we therefore say its degree is 12.

In [None]:
G_collaboration.degree(0)

We can also simply calculate that for everybody and store it in a new vertex attribute called `degree`.

In [None]:
G_collaboration.vs['degree'] = G_collaboration.degree()

We can plot this to look at the degree distribution. For that, we need to load the `matplotlib` package. We import the plotting functionality and name the package `plt`. We also include a statement telling Python to simply plot immediately in the Notebook.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

Now let us plot a histogram of the degree.

In [None]:
plt.hist(G_collaboration.vs['degree'], 50);

This clearly shows that the degree distribution is quite skewed. Most authors have only few collaborators, while a few authors have many collaborators. If the degree distribution is so skewed, it is sometimes referred to as a "scale-free" network.

The code below sorts the nodes in descending order of the degree, and prints the full name and the degree of those with the highest degree. Explaining the code in full detail goes beyond this course, so simply take it for granted.

In [None]:
highest_degree = sorted(G_collaboration.vs, key=lambda v: v['degree'], reverse=True)
[(v['full_name'], v['degree']) for v in highest_degree[:10]]

So, apparently, P. Carmeliet has collaborated with 235 other authors from Belgium! This of course only considers the number of co-authors, it does not take into account the number of papers written with somebody else.
When specifying such *edge weights* like the number of joint papers, the weighted degree is referred to as the *strength* of a node (which is sometimes a bit confusing term). 

Let us look at the strength of node 0.

In [None]:
G_collaboration.strength(0, weights='n_joint_papers')

Apparently, author 0 collaborated with 12 different authors, and has a total strength of 30. But what does this 30 mean? We need to carefully think about this. Suppose that author 0 has co-authored a single publication with three other co-authors, each of the three co-authors would have an edge weight of `n_joint_papers = 1`. So, the strenght would then be 3. Hence, the strength denotes the total number of collaborations that an author had, which depends both on the number of publications and the number of collaborators per paper.

Sometimes, it is undesirable that the total weight of the links created on the basis of a single publication is higher simply because of the total number of co-authorships. We can then fractionally count the weight of each collaboration between $n_a$ authors as

$$\frac{1}{n_a - 1}.$$

This is also available as the vertex attribute `n_joint_papers_frac`.

In [None]:
G_collaboration.strength(0, weights='n_joint_papers_frac')

Normally, the strength calculated in this way would indicate simply the number of collaborative papers. Because not all co-authors are included in this network (only authors from Belgium are included), it does not neatly sum up in our case.

In [None]:
sm.OLS(np.log(np.array(G_collaboration.vs['n_pubs']) + 1), np.log(np.array(G_collaboration.vs['degree']) + 1)).fit().summary()

### Betweenness centrality

Betweenness centrality is much more elaborate, and gives an indication of the number of times a node is on the shortest path from one node to another node.

As you can imagine, this can take quite some time to calculate for all nodes. Let us here calculate it for node 0 again.

<div class="alert alert-warning">
The following may take some time to run
</div>

In [None]:
G_collaboration.betweenness(vertices=0)

As we did previously when dealing with shortest paths, we can also use a weight for determining the shortest paths.

<div class="alert alert-warning">
The following may take some time to run
</div>

In [None]:
G_collaboration.betweenness(vertices=0, weights='n_joint_papers')

As we already found out previously, doing so actually does not correspond with our intuition that shorter paths are better connected. The betweenness centrality defined in this way is therefore not very informative.

<div class="alert alert-danger">
<b>Attention!</b> Weighted shortest paths have the *lowest* total weight.
</div>

<div class="alert alert-warning">
Because betweenness centrality takes such a long time to run, we will not look at the most central authors according to betweenness centrality.
</div>

### Pagerank

One way of identifying central nodes relies on the idea of a random walk in a network. When performing such a random walk, we simply go from one author to the next, following the co-authorship links. The author that is most frequently visited during such a random walk is then seen as most central. This is actually the idea that underlies Google's famous search engine.

In [None]:
G_collaboration.pagerank(0)

In [None]:
G_collaboration.vs['pagerank'] = G_collaboration.pagerank()
highest_pagerank = sorted(G_collaboration.vs, key=lambda v: v['pagerank'], reverse=True)
[(v['full_name'], v['pagerank']) for v in highest_pagerank[:10]]

We again see that P. Carmeliet is the most central author, according to Pagerank. Most other authors also have  a highly similar rank compared to the degree. This is no surprise, since Pagerank resembles the degree, especially on undirected graphs.

We can again take into account the weights. In pagerank this means that a co-author that is a more frequent co-author will be more likely to be visited during a random walk. This is actually much more in line with our intuition than the shortest path. Let us see what we get if we do that.

In [None]:
G_collaboration.vs['pagerank_weighted'] = G_collaboration.pagerank(weights='n_joint_papers')
highest_pagerank = sorted(G_collaboration.vs, key=lambda v: v['pagerank_weighted'], reverse=True)
[(v['full_name'], v['pagerank_weighted']) for v in highest_pagerank[:10]]

We now see some changes in the order of which authors are considered to be most central. What does this say about these authors?

Take a closer look to the discipline of the authors with the highest Pagerank.

In [None]:
[(v['full_name'], v['pagerank_weighted'], v['discipline']) for v in highest_pagerank[:10]]

What do you think of this result? Can you explain why these authors come from this discipline?

## Homophily

One of the analyses that people are frequently interested in is whether people that share some attribute are more likely to be connected. In our case, it offers an opportunity to study the gender divide in science. In our collaboration network, the question is whether men are more likely to work with men and women are more likely to work with women.

In social network analysis, this question can be answered by looking at the *assortativity*. This is a measure that indicates whether people of the same type are more likely to connect (positive assortativity) or less likely to connect (negative assortativity).

Let us see what we obtain for our collaboration network here.

In [None]:
G_collaboration.assortativity_nominal(ig.VertexClustering.FromAttribute(G_collaboration, 'gender').membership)

Based on this measure, it seems that men and women collaborate quit evenly.

Let us see whether people from the same discipline are more likely to collaborate.

In [None]:
G_collaboration.assortativity_nominal(ig.VertexClustering.FromAttribute(G_collaboration, 'discipline').membership)

Clearly, collaboration within the same discipline is much more likely, as expected. Let us see whether the stronger collaborations are even more likely to occur within the same discipline.

In [None]:
H = G_collaboration.subgraph_edges(G_collaboration.es.select(n_joint_papers_gt = 10))

Here, we use the `select` function of the `EdgeSequence` to select edges. The arguments `n_joint_papers_gt = 10` refers to the condition that `n_joint_papers` is **g**reater **t**han `10`. We then use the function `subgraph_edges` to create a new graph with only those selected edges.

Now let us look as the disciplinary assortativity of the stronger collaboration network.

In [None]:
H.assortativity_nominal(ig.VertexClustering.FromAttribute(H, 'discipline').membership)

Does gender homophily perhaps only show up for stronger collaborations?

In [None]:
H.assortativity_nominal(ig.VertexClustering.FromAttribute(H, 'gender').membership)

Perhaps the absense of any gender divide in collaboration is a consequence of disciplinary differences.

First, let us create different subgraphs for each discipline.

In [None]:
disciplines = ig.VertexClustering.FromAttribute(G_collaboration, 'discipline')

Now, we can calculate the gender assortativity for each discipline separately.

In [None]:
gender_assortativity_df = pd.DataFrame([
    {'discipline': H.vs['discipline'][0], 
     'n_nodes': H.vcount(),
     'n_edges': H.ecount(),
     'gender_assortativity': H.assortativity_nominal(ig.VertexClustering.FromAttribute(H, 'gender').membership)
    }
 for H in disciplines.subgraphs()])
gender_assortativity_df

What would you conclude on the basis of these statistics?

Besides assortativity based on gender or discipline, we can also look at assortativity based on degree. Do authors that collaborate a lot tend to collaborate with other authors that collaborate a lot? Or is it the other way around, and do authors that collaborate little tend to collaborate with authors that collaborate a lot? What would you expect?

This can be answered by the degree assortativity.

In [None]:
G_collaboration.assortativity_degree()

### Citation network

Let us now load also the citation data.

In [15]:
citation_df = pd.read_csv('citation.txt', sep='\t')

In [None]:
G_citation = ig.Graph.DictList(
      vertices=authors_df.to_dict('records'),
      edges=citation_df.to_dict('records'),
      vertex_name_attr='author_id',
      edge_foreign_keys=('author_id_from', 'author_id_to'),
      directed=True
      )

In [None]:
G_citation.reciprocity()

Let us see if (wo)men are more likely to cite other (wo)men.

In [None]:
G_citation.assortativity(ig.VertexClustering.FromAttribute(G_citation, 'gender').membership)

This seems to be a very weak signal of any homophily.

Stronger citation links perhaps show a higher degree of gender homophily.

In [None]:
H = G_citation.subgraph_edges(G_citation.es.select(n_cits_gt = 15))
H.assortativity(ig.VertexClustering.FromAttribute(H, 'gender').membership)

Indeed, there is some indication that stronger citations show more signs of gender homophily.

## Weak links

The weak link theory we discussed in class had two elements:

1. Weak links tend to connect people from different groups.
2. New information enter groups through weak links.

Let us start by analysing whether authors collaborate more frequently with authors from the same group. An easy start is simply by looking at whether the average collaboration is stronger within disciplines than between disciplines.

In [None]:
disciplines = ig.VertexClustering.FromAttribute(G_collaboration, 'discipline')

We now want to calculate the mean number of joint papers across disciplinary boundaries, and within disciplinary boundaries. To calculate the mean, the `numpy` package comes in handy. Let us import that first, and call it `np`.

In [None]:
import numpy as np

Now we can calculate the mean `n_joint_papers` that are crossing the disciplinary boundaries.

In [None]:
edge_crossing = disciplines.crossing()
np.mean([e['n_joint_papers'] for e in G_collaboration.es if edge_crossing[e.index]])

Let us compare that to the mean that does not cross disciplinary boundaries.

In [None]:
np.mean([e['n_joint_papers'] for e in G_collaboration.es if not edge_crossing[e.index]])

Clearly, collaborations that cross disciplinary boundaries tend to be weaker: the average numer of joint papers within disciplinary boundaries is more than twice as high than across disciplinary boundaries.

Will this also at a much finer scale?

Let us look at whether a link that connects authors that share many co-authors tends to be stronger? In other words, if co-authors share many other co-authors, they tend to collaborate more frequently.

In [None]:
G_collaboration.es['similarity_jaccard'] = G_collaboration.similarity_jaccard(pairs=[e.tuple for e in G_collaboration.es])

Using `numpy` we can easily calculate the correlation coefficient between the Jaccard similarity and the number of joint papers.

In [None]:
np.corrcoef(np.log(G_collaboration.es['similarity_jaccard']), np.log(G_collaboration.es['n_joint_papers']))

Indeed, also at such a detailed scale, it seems that people that share many collaborators tend to collaborate more frequently.

Now let us plot the results.

In [None]:
plt.loglog(G_collaboration.es['similarity_jaccard'], G_collaboration.es['n_joint_papers'], '.', alpha=0.05);

What do you make of this?

As is often the case, the second part of the theory on weak links is much more difficult to steady. It requires us to collect data on information that is spreading across co-authors. We will therefore not go into this here.

## Structural holes

As discussed during class, people that fill structural holes occupy a particular position in the network. They are able to broker information between people that would otherwise not be connected. We can quantify this based on the clustering coeffient.

Typically, nodes with a higher degree will show a lower clustering degree. This is partly due to the fact that the higher number of neighbors simply will not tend to all collaborate with each other. If there are only 4 or 5 co-authors, it is not unlikely they will all collaborate among each other. However, if you get to 100 co-authors, it is rather unlikely that all of them collaborate.

In [None]:
plt.plot(G_collaboration.degree(), G_collaboration.transitivity_local_undirected(mode='zero'), '.', alpha=0.5)
plt.xscale('log')

Now let us see if authors that tend to connect different people are more frequently cited on average.

First let us calculate the total and average number of citations.

In [None]:
G_citation.vs['total_cits'] = G_citation.strength(weights='n_cits')
G_citation.vs['avg_cits'] = np.array(G_citation.vs['total_cits'])/np.array(G_citation.vs['n_pubs'])

Let us calculate the correlation between the average number of citations and the clustering coefficient.

In [None]:
np.corrcoef(G_collaboration.transitivity_local_undirected(mode='zero'), G_citation.vs['avg_cits'])

Let us also make scatter plot of the results.

In [None]:
plt.plot(G_collaboration.transitivity_local_undirected(mode='zero'), G_citation.vs['avg_cits'], '.', alpha=0.1)
plt.yscale('log')

What do you make of this relationship? Are authors that have a higher cluster coefficient more frequently cited? What caveats does this analysis have?

# Citations and co-authorship

Let us delve a bit further into the relationship between citations and co-authorship.

First, let us see how the citation degree—measuring how may authors cite an author—compares to the collaboration degree. For the citation degree, there are two variants: the in-degree—how many authors cite an author—and the out-degree—how many authors are cited by an author.

In [None]:
f = plt.figure(figsize=(18,6))
ax = f.add_subplot(1, 2, 1)
ax.plot(G_collaboration.degree(), G_citation.degree(mode='in'), '.')
ax2 = f.add_subplot(1, 2, 2)
ax2.plot(G_collaboration.degree(), G_citation.degree(mode='out'), '.')

Especially the citation out-degree correlates quite well with the collaboration degree. The numnber of authors that are cited by an author is roughly proportional to the number of collaborators of an author. That raise the question: do people tend to cite their co-authors? Going one step further: do people tend to cite co-authors of their co-authors?

Let us try to analyse that a bit further. Let us first ask `igraph` to calculate the shortest path distance between all pairs of nodes.

In [None]:
comms = G_collaboration.community_multilevel()

In [None]:
max_comm, max_size = max(enumerate(comms.sizes()), key=lambda x: x[1])

In [None]:
H_collaboration = G_collaboration.induced_subgraph(comms[max_comm])
H_citation = G_citation.induced_subgraph(comms[max_comm])

<div class="alert alert-warning">
The following may take some time to run
</div>

In [None]:
dist = np.array(H_collaboration.shortest_paths())

Now, let us look at the number of citations for all pairs of nodes.

In [None]:
n_cits = np.array(H_citation.get_adjacency('n_cits').data)

We can create a dataframe for this information.

In [None]:
n_cits_per_dist_df = pd.DataFrame({'n_cits': np.reshape(n_cits, -1),
                                   'dist': np.reshape(dist, -1)})

In [None]:
avg_cit_per_dist = n_cits_per_dist_df.groupby('dist').mean()
dist_hist = n_cits_per_dist_df.groupby('dist').size()

f = plt.figure(figsize=(9,6))
ax = f.add_subplot(1, 1, 1)
l1, = ax.plot(avg_cit_per_dist.index, avg_cit_per_dist, color='b', label='Avg. cits.')
ax.set_yscale('log')
ax.spines['left'].set_color('b')
ax.set_ylabel('Avg. cits.')

ax2 = ax.twinx()
l2, = ax2.plot(dist_hist.index, dist_hist/sum(dist_hist), color='r', label='Dist. freq.')
ax2.set_yscale('log')
ax2.set_ylabel('Dist. freq')
lines = [l1, l2]
plt.legend(lines, [l.get_label() for l in lines], loc='best')