# Practice Session 07: Hubs and authorities


In this session we will compute Hubs and Authorities using [NetworkX](https://networkx.github.io/), a Python package. This analysis is inspired by a paper on international trade ([Deguchi et al. 2014](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0100338)).

The dataset we will use comes from OECD's [trade in value by partner country](https://stats.oecd.org/Index.aspx?DataSetCode=PARTNER) dataset, and these are your input files:

* ``trade_1980-flows.csv`` international trade in 1980
* ``trade_2013-flows.csv`` international trade in 2013
* ``trade_2013-countries.csv`` list of countries and territories

Plese note that the [HITS](https://en.wikipedia.org/wiki/HITS_algorithm) or hubs and authorities algorithm is implemented in NetworkX and elsewhere, but the other implementations follow a different design. Do this assignment on your own, following the design in this Notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 0. Code snippets you may need

## 0.1. Reading a compressed CSV file

```python
# Open a compressed file for reading in text mode
with gzip.open(FILENAME, "rt") as input_file:

    # Create a CSV reader for a comma-delimited file with a header
    reader = csv.DictReader(input_file, delimiter=',')
    
    # Iterate through records, each record is a dictionary
    for record in reader:
        print(record)
```

## 0.2. Creating a directed graph in NetworkX

Create an empty graph using:

```python
g = nx.DiGraph()
```

Then read the file and add weighted edges to `g`. To add a weighted edge from node *u* to node *v* with weight *w*, use `g.add_edge(u, v, weight=w)`.


## 0.3. Iterate through a graph in NetworkX

To iterate through the nodes of a graph:

```python
for n in g.nodes():
    # n is the name of the node
```

To iterate through the edges of a graph:

```python
for u, v, d in g.edges(data=True):
    w = d['weight']
    # u is the source, v the destination, w the weight
```

## 0.4. Create an empty dictionary from a set

Suppose you want to create dictionary `d` initialized so that all keys are the elements of set `s` and all values are zero:

```python
d = dict([(element,0) for element in s])
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. Read mapping of codes to country names

Read into variable ``id2name`` the file containing the list of countries and territories.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [1]:
import csv
import gzip
import io
import networkx as nx
import matplotlib.pyplot as plt
import math
import pandas as pd

In [2]:
INPUT_NAMES_FILENAME = "trade-countries.csv.gz"
INPUT_TRADE_1980 = "trade_1980-flows.csv.gz"
INPUT_TRADE_2013 = "trade_2013-flows.csv.gz"

# Organisation for Economic Co-operation and Development
OECD = set(["AUS", "AUT", "BEL", "CAN", "CHL", "COL", "CZE", "DNK", "EST", "FIN", "FRA",
        "DEU", "GRC", "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR", "LVA", "LTU",
        "LUX", "MEX", "NLD", "NOR", "NZL", "POL", "CHE", "ESP", "GBR", "PRT", "SVK",
        "SVN", "SWE", "TUR", "USA"])

# Brazil, Russia, India and China
BRIC = set(["BRA", "RUS", "IND", "CHN"])

# COUNTRIES
COUNTRIES = OECD.union(BRIC)

In [3]:
id2name = {}

<font size="+1" color="red">Replace this cell with your code to read country names into id2name.</font>

# 2. Read flows data into two graphs

Read the graphs as directed graphs into variables `g_old` (1980 data) and `g_new` (2013 data).

In some rows the `amount` is empty. Hence, you will have to consider those as zeroes. You can do, for instance: `amount = float(record["amount"]) if len(record["amount"]) > 0 else 0.0`. Divide the amounts by one million and round to the nearest integer, so the weights will be expressed in millions of dollars.

Add only the edges that describe a trade amount that is non-zero (i.e., more than half a million dollars), and that involve OECD or BRIC countries; we do not care if the country is OECD or BRIC, but the country must belong to one of those groups to be included in the graph.

Notice the column "Import":

* If the flow is "Import", add an edge *from country2 to country1*.
* If the flow is "Export", add an edge *from country1 to country2*.

Now, the problem is that in most cases the data has some inconsistencies, for instance:

```
Import,USA,ITA,USD,3221525000
Export,ITA,USA,USD,2992289762
```

These numbers are different because they are reported by different countries, probably using different accounting methods. The best solution is to take the average of both numbers.

For that, you can use `g.has_edge(source, dest)` to check if an edge already exists, `g.get_edge_data(source, dest, key='weight')` to obtain the label "weight" of an existing edge, `g.remove_edge(source, dest)` to remove the edge, and `g.add_edge(source, dest, weight=x)` to add an edge from *source* to *dest* with weight *x*.

You can assume each edge appears at most twice, or keep two labels per edge: *sum* and *count*, and then create a new label with the *sum/count* for every edge after reading the graph.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to read the two graphs. Read the 1980 trade graph into variable g1980, read the 2013 graph into variable g2013.</font>

Draw the two graphs using NetworkX. Define an auxiliary function named `plotGraph` for this, and use it to plot both graphs.

You can adapt the following code snippet. If `g` is a NetworkX graph in which weights are expressed in the variable *weight*, the following layouts a graph using a spring model. The variable *EDGE_WIDTH_MULTIPLIER* should be a number smaller than zero used to reduce the edge widths to a manegeable size (set by trial and error).

```python
plt.figure(figsize=(20,10))

# Layout the graph using a spring model
pos = nx.spring_layout(g, iterations=100, weight="weight")

# Draw the nodes on the screen
nx.draw_networkx_nodes(g, pos, node_size=700)

# Add labels to the nodes
nx.draw_networkx_labels(g, pos)

# Create an array with edge widths
edgewidth = []
for u, v, d in g.edges(data=True):
    weight = math.log(d['weight'])*EDGE_WIDTH_MULTIPLIER
    edgewidth.append(weight)

# Use the edgewidth array to draw the edges
_ = nx.draw_networkx_edges(g, pos, width=edgewidth, )
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to plot the two graphs.</font>

<font size="+1" color="red">Replace this cell with a brief commentary on what are the similarities and differences you observe between these two graphs.</font>

# 3. Compute total imports and exports


Compute total imports and total exports per country. Store total imports in a dictionary indexed by country name `total_imports` and total exports in `total_exports`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with a function named "totals" that should return the "total_imports" and "total_exports" for a graph. Remember in the graph an edge from u to v means u exports towards v.</font>

Print these dictionaries, using the code provided below. What we are doing is converting the data to a [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and use Panda's printing functions.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [8]:
sorted_countries = sorted(COUNTRIES)

flowsData = {
    'exp1980': [exports1980[c] for c in sorted_countries],
    'imp1980': [imports1980[c] for c in sorted_countries],
    'exp2013': [exports2013[c] for c in sorted_countries], 
    'imp2013': [imports2013[c] for c in sorted_countries],
}

flowsDF = pd.DataFrame(flowsData, index=sorted_countries)
flowsDF

Unnamed: 0,exp1980,imp1980,exp2013,imp2013
AUS,1253,1273,16590,13357
AUT,1173,1678,12276,12992
BEL,4331,4876,34348,33877
BRA,1078,873,8069,9519
CAN,4825,4303,36483,34554
CHE,2089,2806,16231,16651
CHL,259,246,5519,5150
CHN,715,1100,100868,49216
COL,294,291,2859,2787
CZE,0,0,12441,11042


# 4. Compute hubs and authorities

Implement the iterative algorithm seen in class for hubs and authorities. Instead of vectors, we will use two dictionaries having country names as keys. Dictionary `h` should contain the hub scores, while dictionary `a` should contain the authority scores.

Start from normalized hub scores: the hub score should be *1/N* for each country, where *N* is the number of countries.

Then, perform 50 iterations of the following:

1. Compute authority scores from hub scores; do not normalize the edge weights
1. Normalize the authority scores.
1. Compute hub scores from authority scores; do not normalize the edge weights
1. Normalize the hub scores.

Create two functions: `normalize(d)` that sums the values of a dictionary and then divides each one by the sum, returning the resulting dictionary, and `hubs_authorities(g, total_exports, total_imports)` that computes hubs and authorities.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code for functions "normalize" and "hubs_authorities". Remember to include comments to explain what your code does at every relevant step.</font>

In [10]:
print("Computing for 1980")
(h1980,a1980) = hubs_authorities(g1980, exports1980, imports1980)

print("Computing for 2013")
(h2013,a2013) = hubs_authorities(g2013, exports2013, imports2013)

Computing for 1980
Computing for 2013


In [11]:
# Add these columns to your data frame
flowsDF['h1980'] = pd.Series(h1980)
flowsDF['a1980'] = pd.Series(a1980)
flowsDF['h2013'] = pd.Series(h2013)
flowsDF['a2013'] = pd.Series(a2013)

flowsDF

Unnamed: 0,exp1980,imp1980,exp2013,imp2013,h1980,a1980,h2013,a2013
AUS,1253,1273,16590,13357,0.013263,0.016626,0.012859,0.022212
AUT,1173,1678,12276,12992,0.014044,0.021715,0.009531,0.013785
BEL,4331,4876,34348,33877,0.057163,0.06305,0.026143,0.028081
BRA,1078,873,8069,9519,0.014295,0.01368,0.013169,0.008886
CAN,4825,4303,36483,34554,0.078239,0.082594,0.126571,0.046513
CHE,2089,2806,16231,16651,0.0249,0.034353,0.017678,0.017384
CHL,259,246,5519,5150,0.003467,0.003885,0.006134,0.00821
CHN,715,1100,100868,49216,0.008397,0.015265,0.196016,0.045361
COL,294,291,2859,2787,0.004093,0.004696,0.008168,0.003081
CZE,0,0,12441,11042,,,0.007471,0.012058


Print the top 5 countries by exports, imports, hubs, and authorities in 2013 (that is, print 4 lists).

You can use the following command: `display(flowsDF.sort_values(by=colname, ascending=False).head(5))`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with commands to print the four lists above.</font>

<font size="+1" color="red">Replace this cell with a brief commentary in which you compare the four lists above.</font>

# 5. Comparison of hub/export, authority/import scores

Now, we will compare the hub score of a country against its exports, and the authority scores of a country against its imports.

We can do this visually by plotting both in log-log scale. You can use the following code snippet, which assumes we are plotting dictionary *a* against dictionary *b*:

```python
# Create log-log plot
plt.figure(figsize=(20,10))
plt.loglog()

# Add a diagonal line
plt.plot([min(a.values()),max(a.values())], [min(b.values()),max(b.values())], '-.', lw=2)

# Do the scatter plot with texts
for country in set(a.keys()).intersection(set(b.keys())):
    plt.text(a[country], b[country], country)
```

Remember to add a title, as well as labels to the x axis and y axis before delivering your plots, and to use a function to draw your plots: do not duplicate code.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with a function to create a scatter plot. Remember to include a titleas well as labels to the x axis and y axis.</font>

<font size="+1" color="red">Replace this cell with two scatterplots for 1980: one comparing hub scores against exports, and the other comparing authority scores against imports.</font>

<font size="+1" color="red">Replace this cell with a brief commentary in which you compare the two plots above.</font>

<font size="+1" color="red">Replace this cell with two scatterplots for 2013: one comparing hub scores against exports, and the other comparing authority scores against imports.</font>

<font size="+1" color="red">Replace this cell with a brief commentary in which you compare the two plots above.</font>

<font size="+1" color="red">Replace this cell with a brief commentary with some overall conclusions.</font>

# Deliver (individually)

A .zip file containing:

* This notebook.


## Extra points available

If you would like to go for extra points (+2, so your maximum grade can be a 12 in this assignment), export this network to .csv and import in Cytoscape. Then, do a clustering analysis in Cytoscape, paint the clusters with colors, and insert the image on this Notebook, with a brief commentary.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: country clusters</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>