# General setup

In [0]:
# first set whether this notebook is being run by you or by Elsevier supporters
permissions='default' # default or fulldata (default if by you)
# set the project name
project_name = '2806_network_basics'

In [0]:
%run /Snippets/header_latest

# Create network

## Load datasets

In [0]:
df_ani = spark.read.format("parquet").load(basePath+tablename_ani) # Table of publications
df_apr = spark.read.format("parquet").load(basePath+tablename_apr) # Table of authors
df_asjc = table('static_data.asjc')                                # Table of all ASJC

## Prepare co-authorship data

Let us create a co-author dataframe. To do this, we will expand (`func.explode`) the list of authors twice: this will create a list of all combinations of authors.

In [0]:
df_co_au = (
  df_ani
  .filter(func.array_contains('ASJC', '3309'))        # Limit to library and information science
  .filter('Year BETWEEN 2000 AND 2020')               # Limit to year 2000--2020
  .select('Au', func.explode('Au').alias('Au_A'))     # Get list of all authors
  .select('Au_A', func.explode('Au').alias('Au_B'))   # Get pair of all authors
  .select(func.col('Au_A.auid').alias('auid_A'),      # Call one author of pair auid_A
          func.col('Au_B.auid').alias('auid_B'))      # Call other author of pair auid_B
  .filter('auid_A < auid_B')                          # Only retain one of the two pairs, e.g. only (1, 3) not (3, 1)
)

Let us get the unique authors in this dataset

In [0]:
df_uniq_au = (
  df_co_au
  .select(func.col('auid_A').alias('auid'))
  .union(df_co_au.select(func.col('auid_B').alias('auid')))
  .distinct()
)

Now let us add a bit of information of authors to the dataframe

In [0]:
df_uniq_au = (
  df_uniq_au
  .join( (df_apr
          .select('auid', func.struct('given_name_pn', 'initials_pn', 'surname_pn', 'indexed_name_pn').alias('name'),
                      func.col('affiliation_current_full')[0].alias('affiliation'),
                      func.arrays_zip('ASJC_frequency_I', 'ASJC').alias('ASJC'))
          .withColumn('max_ASJC', func.sort_array('ASJC', asc=False)[0])
          .select('auid', 'name', 'affiliation', func.col('max_ASJC.ASJC').alias('ASJC'))
         )
        ,on='auid'
        ,how='left')
  .orderBy('auid')
)

Now `df_co_au` contains all the links between all co-authors, while `df_uniq_au` contains information about each author. Let us see how large this dataset is.

In [0]:
print(df_uniq_au.count())
print(df_co_au.count())

## Construct the network

For network analysis we will use the library `igraph`. This is already pre-installed for us, so we can just load it.

In [0]:
import igraph as ig

Now we can construct our network.
We need to convert our `pyspark` dataframe to a `pandas` dataframe for `igraph` to be able to construct the network.

In [0]:
G_coauthorship = ig.Graph.DataFrame(edges=df_co_au.toPandas(),
                       vertices=(df_uniq_au.select('auid', func.col('name.*'), 
                                                   func.col('affiliation.afdispname'), 
                                                   func.col('affiliation.city'), 
                                                   func.col('affiliation.country'), 
                                                   'ASJC')
                                .toPandas()
                                ),
                       directed=False
                      )

Note that, unlike `pyspark`, both `pandas` and `igraph` are not distributed.
This means that you should be careful when using large datasets (many millions of edges and/or nodes).
It is still possible to work with such large datasets in `igraph`, but you may need to pay special attention to certain aspects.

# Basic network operations

Now that we have created some scientometric networks, let us look at some basic analyses of these networks. Let us first start with some basic information about the network.

In [0]:
print(G_coauthorship.summary())

The first line indicates that we have an undirected graph (`U`) with 3678 nodes and 12.083 links. The next line shows vertex attributes (indicated by the `v` behind the name of the attribute), and edge attributes (indicated by an `e`), of which there are currently none.

## Connectivity

**❓  Let us start with a very simple question. Is the co-authorship network connected?**

In [0]:
G_coauthorship.is_connected()

Apparently, not all authors in this dataset are connected via co-authored papers.

**❓ How many authors do you think will be connected to each other? 500? 5000? Almost everybody?**

In order to take a closer look, we need to detect the *connected components*.

**❓ Get all components (use the function `components`).**

In [0]:
components = G_coauthorship.components()

We might be interested in the so-called giant component.

**❓ Can you get the giant component from the components `components` variable?**

Hint: you can use `Tab` and `Shift-Tab` to find out more about possible functions.

In [0]:
H = components.giant()

**❓ How many nodes and edges are there in the giant component?**

In [0]:
print(H.summary())

**❓ What is the percentage of nodes that are in the giant component?**

**❓ Double check whether the giant component is connected.**

In [0]:
H.is_connected()

## Shortest paths

Let us take a closer look at how far authors in this data set are apart from one another. 
Let us simply take a look at node number `70` and node number `167`. 
These are node *indices* not node *labels*. 
That is, these are not `auid`'s.

In [0]:
paths = G_coauthorship.get_shortest_paths(70, to=167)
paths

This returns a list of a shortest path of the nodes between node number 70 and the other nodes, specified by `to`. 
In this case, we only specified one other node, namely node number 167, so let us select that.

In [0]:
path = paths[0]
path

**❓ How many nodes are in the path? What is the path length?**

In [0]:
len(path)

The numbers in `path` probably do not mean that much to you. 
You can find out more about an individual node by looking at the `VertexSequence` of `igraph`, abbreviated as `vs`.
This is a sort of list of all vertices, and is indexed by brackets `[ ]`, similar to lists, instead of parentheses `( )` as we do for functions.

In [0]:
G_coauthorship.vs[70]

The vertex itself is also a type of *dictionary*, and you can show for example the author name as follows

In [0]:
G_coauthorship.vs[70]['indexed_name_pn']

You can also list multiple vertices at once.

In [0]:
G_coauthorship.vs[[70, 186, 167]]['indexed_name_pn']

You can of course also simply pass the variable `path` that we constructed earlier.

In [0]:
G_coauthorship.vs[path]['indexed_name_pn']

This shows that O'Hearn collaborated with Lainhart, who collaborated with Buckner, who collaborated with Roffman.
You can also get the vertex by searching for the author name. For example, if we want to find `'Lainhart J.'` we can use the following.

In [0]:
G_coauthorship.vs.find(indexed_name_pn_eq = 'Lainhart J.')

Here `indexed_name_pn_eq` refers to the condition that the vertex attribute `indexed_name_pn` should **eq**ual `'Lainhart J.'`.

**❓ Find the shortest path from `'Rosen B.'` to `'Huckins J.'`. Who is in between?**

In [0]:
u = G_coauthorship.vs.find(indexed_name_pn_eq = 'Rosen B.')
v = G_coauthorship.vs.find(indexed_name_pn_eq = 'Huckins J.')
paths = G_coauthorship.get_shortest_paths(u, v)
path = paths[0]
G_coauthorship.vs[path]['indexed_name_pn']

You can also explicitly provide vertex *names*. These are used internally by `igraph` to look-up the vertices in certain circumstances. However, you need to be careful when using this, because you need to make sure that the used `name` is unique. Let us check whether there are any authors that have the same name.

In [0]:
display(
  df_uniq_au
  .groupBy('name.indexed_name_pn')
  .count()
  .filter('count > 1')
)

indexed_name_pn,count
Liu L.,3
Zhang X.,5
Chen L.,2
Zhu H.,2
Mao J.,2
Yao X.,2
Lin C.,4
Gao F.,2
Kim H.,2
Li Z.,2


Given these results, we better not use the `name` attribute for indexing.

Shortest paths do not need to be unique, there can be multiple shortest paths. We can also get all shortest paths from `igraph`. Let's consider all shortest paths between `'Rosen B.'` and `'Huckins J.'`.

In [0]:
u = G_coauthorship.vs.find(indexed_name_pn_eq = 'Rosen B.')
v = G_coauthorship.vs.find(indexed_name_pn_eq = 'Huckins J.')
G_coauthorship.get_all_shortest_paths(u, v)

**❓ Can you find all paths between `'Williams R.'` and `'Pollock N.'`? What do you think of the results?**

In [0]:
u = G_coauthorship.vs.find(indexed_name_pn_eq = 'Williams R.')
v = G_coauthorship.vs.find(indexed_name_pn_eq = 'Pollock N.')
G_coauthorship.get_all_shortest_paths(u, v)

These are just multiple times the same results!
Apparently, Williams and Pollock are directly connected, but with multiple edges.
That is, they have co-authored multiple publications.
Indeed, the network may contains may duplicate edges, for each paper that we initially specified.
We can check if a network is "simple", that is, not containing any duplicate edges or self-loops.

In [0]:
G_coauthorship.is_simple()

## Weighted paths

If we want, we can simplify the graph. This will remove the duplicate edges, and replace them with a single edge. If we want, we can introduce an edge attribute to count the number of joint papers that were co-authored for each edge.

In [0]:
G_coaut_simple = G_coauthorship.copy()
G_coaut_simple.es['n_joint_papers'] = 1
G_coaut_simple.simplify(combine_edges='sum')
print(G_coaut_simple.summary())

As you can see, this reduced the number of edges from 12.083 to 11.758, meaning there were a few hundred duplicate edges.

Let us take a closer look at the path between `'Opthof T.'`, and `'Bornmann L.'`.
Instead of the nodes on the path, we now want to take a closer look at the edges on the path.

In [0]:
u = G_coaut_simple.vs.find(indexed_name_pn_eq='Opthof T.')
v = G_coaut_simple.vs.find(indexed_name_pn_eq='Bornmann L.')
epath = G_coaut_simple.get_shortest_paths(u, to=v, output='epath')
epath

There are two edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the `VertexSequence` we encountered earlier, there is also an `EdgeSequence`, abbreviated as `es`.

In [0]:
[G_coaut_simple.vs[e.tuple]['indexed_name_pn'] for e in G_coaut_simple.es[epath[0]]]

As you can see, each edge has a name in commmon with the next edge. However, the order of the names in an edge is arbitrary, since edges are undirected in this network.

Let us take a closer look at the number of joint papers that the authors had co-authored.

In [0]:
G_coaut_simple.es[epath[0]]['n_joint_papers']

**❓ Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?**

In [0]:
u = G_coaut_simple.vs.find(indexed_name_pn_eq='Opthof T.')
v = G_coaut_simple.vs.find(indexed_name_pn_eq='Bornmann L.')
epath = G_coaut_simple.get_shortest_paths(u, v, weights='n_joint_papers', output='epath')
epath

We do get a different path, which it is actually longer. Let us take a look at the number of joint papers.

In [0]:
G_coaut_simple.es[epath[0]]['n_joint_papers']

The total number of joint papers is lower! That is because *shortest path* means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the *shortest path*.

**Attention! Weighted shortest paths have the *lowest* total weight.**

We can also use fractional counting to define the co-authorship network.
However, we cannot do this directly in `igraph`, since we need to know for each paper how many authors there were.
Therefore, let us go back to `pyspark`.
In general, it is usually a good idea to try to define the network before using `igraph`, and only use `igraph` for the network related functionality.

In [0]:
df_co_au_frac = (
  df_ani
  .filter(func.array_contains('ASJC', '3309'))        # Limit to library and information science
  .filter('Year BETWEEN 2000 AND 2020')               # Limit to year 2000--2020
  .filter(func.size('Au') > 1)                        # Limit to publications with multiple authors
  .withColumn('weight', 1/(func.size('Au') - 1))      # Calculate fractional weight
  .select('Au', 'weight', func.explode('Au').alias('Au_A'))     # Get list of all authors
  .select('Au_A', 'weight', func.explode('Au').alias('Au_B'))   # Get pair of all authors
  .select(func.col('Au_A.auid').alias('auid_A'),      # Call one author of pair auid_A
          func.col('Au_B.auid').alias('auid_B'),      # Call other author of pair auid_B
          'weight')                                   # And get the weight
  .filter('auid_A < auid_B')                          # Only retain one of the two pairs, e.g. only (1, 3) not (3, 1)
  .groupBy('auid_A', 'auid_B')                        # Group by the pair of authors
  .agg(func.sum('weight').alias('weight'))            # And sum all the weights
)

Let us create the network, similar to before.

In [0]:
G_coaut_frac = ig.Graph.DataFrame(edges=df_co_au_frac.toPandas(),
                       vertices=(df_uniq_au.select('auid', func.col('name.*'), 
                                                   func.col('affiliation.afdispname'), 
                                                   func.col('affiliation.city'), 
                                                   func.col('affiliation.country'), 
                                                   'ASJC')
                                .toPandas()
                                ),
                       directed=False
                      )

The total number of links should be identical to the `simple` co-authorships network that we defined.

In [0]:
G_coaut_frac.summary()

The weights however are differently now, and are stored in the edge attribute `weight`.

**❓ Can you get the weighted shortest path from `'Opthof T.'` to `'Bornmann L.'` using the `weight` as weights? What are the fractional weights of this path?**

In [0]:
u = G_coaut_frac.vs.find(indexed_name_pn_eq='Opthof T.')
v = G_coaut_frac.vs.find(indexed_name_pn_eq='Bornmann L.')
epath = G_coaut_frac.get_shortest_paths(u, to=v, output='epath', weights='weight')
G_coaut_frac.es[epath[0]]['weight']

As you know, the total number of links of a node is called the *degree* of a node. 
The sum of the weights of all these links is called the *strength* of a node.
Let us calculate the strength of the node in the fractional and the simple network.

In [0]:
s_frac = G_coaut_frac.strength(G_coaut_frac.vs.find(indexed_name_pn_eq='Bornmann L.'),
                               weights='weight')
s_simple = G_coaut_simple.strength(G_coaut_simple.vs.find(indexed_name_pn_eq='Bornmann L.'),
                                   weights='n_joint_papers')
print(s_frac, s_simple)

**❓ What do these numbers represent?**