# Remarks:

- Save your adjacency matrix with an evident name in the sub folder "<b><i>./Data/Adjacencies/</i></b>" and update list next cell.
- Only overwrite these if you're sure of what you're bringing
- If you change Imports, make sure the rest works
- Please check your results



### Adjacencies available:
- **adjacency_hyperlinks**: constructed with every category and links based on hyperlinks. **This is directed and it is normal!** If you need it otherwise, symmetrise it and save it in another csv. 
- ...

### Useful functions:

From Task 1:
- sparsify_adjacency(adjacency, All_Nodes, epsilon, seed): sparsify the adjacency matrix and also returns the list of nodes in it and their categories.

- assign_values(All_Nodes, assign_val): assign a value to each category based on assign_val. First entry for 'player', second for 'Country' and last for 'National Team'.

From Task 2:
- ...



## Install

...

# Imports:

In [None]:
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import copy

from pygsp import graphs, filters, reduction, plotting
from scipy import sparse

%matplotlib inline

# Part I
# Structure of the Graph with Hyperlink Connections between sites

## Import Data

In [None]:
Nodes_Linked = pd.read_csv("./Data/Nodes_Linked.csv", sep='\t', encoding= 'utf-16')
All_Nodes = pd.read_csv("./Data/All_Nodes.csv", sep='\t', encoding= 'utf-16')


## Make Adjacency (great again)

If you are not going to change it, you can skip this and load it.

In [None]:
All_Nodes.reset_index(level=0, inplace=True)
All_Nodes = All_Nodes.rename(columns={'index':'node_idx'})


# Create a conversion table from name to node index.
name2idx = All_Nodes[['node_idx', 'Node']]
name2idx = name2idx.set_index('Node')

Nodes_Linked = Nodes_Linked.join(name2idx, on='Nodes')
Nodes_Linked = Nodes_Linked.join(name2idx, on='Links', rsuffix='_target')
Nodes_Linked_Full = Nodes_Linked.copy(deep=True)
Nodes_Linked = Nodes_Linked.drop(columns=['Nodes', 'Links', 'Node_Category'])
Nodes_Linked['node_idx'] = Nodes_Linked['node_idx'].astype(int)
Nodes_Linked['node_idx_target'] = Nodes_Linked['node_idx_target'].astype(int)

<b>Check if any value is Nan!</b>

In [None]:
Nodes_Linked.isnull().any().any()

<b>Great! Now build the Adjacency Matrix</b>:

In [None]:
n_nodes = len(All_Nodes)
print("Number of nodes ", n_nodes)
adjacency = np.zeros((n_nodes, n_nodes), dtype=int)
for idx, row in Nodes_Linked.iterrows():
    if np.isnan(row.node_idx_target):
        continue
    i, j = int(row.node_idx), int(row.node_idx_target)
    adjacency[i, j] = 1

<b>Set Diagonal to 0</b>:

In [None]:
Sum = 0
for i in range(n_nodes):
    Sum += adjacency[i,i]
    adjacency[i,i] = 0
print("Sum of values on the diagonal was " +str(Sum)+". Now it's 0.")

**Display:**

In [None]:
fig = plt.figure(figsize = (15,8))
ax1 = fig.add_subplot(1,2,1)
ax1.spy(adjacency, markersize=1)
ax1.set_title('Adjacency Matrix')
ax2= fig.add_subplot(1,2,2)
ax2.spy(adjacency[700:, 700:], markersize=1)
ax2.set_title('Adjacency Matrix Zoomed on [700:,700:]')

plt.show()

print("Diagonal on the right? Example adjacency(760, 792) = " + str(adjacency[760, 792]) +\
      ". Corresponds to link (" + str(All_Nodes.iloc[760,1])+"," + str(All_Nodes.iloc[792,1])+").")

We can clearly observe that the first 732 entries are players, connecting to about anything. They are then followed by the 32 countries taking part in the world cup, only connecting themselves and their respective national teams (though some other teams may appear in their sport history due to some notable event. This is the case for Iceland for example, as can be seen below). Finally, the national teams connect to everyone (and themselves heavily, since the history of matches maps this).

In [None]:
Nodes_Linked_Full.iloc[[24642, 24643, 24651], :]

## Save the Adjacency

In [None]:
if (0):
    df_adjacency = pd.DataFrame(adjacency)
    df_adjacency.to_csv('./Data/Adjacencies/adjacency_hyperlinks.csv', index=False)

## Let's make the matrix sparse and display the graph associated

Check that it is indeed connected. 

In [None]:
adjacency = pd.read_csv("./Data/Adjacencies/adjacency_hyperlinks.csv").values

In [None]:
adjacency_sparsed = sparse.csr_matrix(adjacency)

In [None]:
G = graphs.Graph(adjacency_sparsed)
print('{} nodes, {} edges'.format(G.N, G.Ne))

print('Connected: {}'.format(G.is_connected()))
print('Directed: {}'.format(G.is_directed()))
fig = plt.figure(figsize = (15,8))
plt.hist(G.d)
plt.title('Degree Distribution of the Graph')
plt.xlabel('Degree Value')
plt.ylabel('Number of node in that range')
plt.show()

#print("Maximum of " +str(G.d[783])+ " corresponds to " + str(All_Nodes.iloc[783, 1]) +".")

So we do have a connected and directed graph. Note that the average number of connection is quite high (27532/800 = 34.42)! The extremum of 122 corresponds to the Croatian national football team (https://en.wikipedia.org/wiki/Croatia_national_football_team) which has an amazingly complete page.

The next plot of the graph takes a good while. 

In [None]:
fig, ax = plt.subplots(figsize=(20, 20))
G.set_coordinates()
G.plot(backend= 'matplotlib', plot_name = 'Graph of Hyperlinks',\
    show_edges = 'True',vertex_size =50, ax = ax, save_as="png")
if(0):
    plt.savefig('./Images/TASK1_Graph_Hyperlinks.png')
plt.show()

This is quite naturally very dense and hard to read. To observe the structure, let's <b>get a sparsified version of this graph</b> as governed by the parameter _epsilon_.

In [None]:
epsilon = 0.4
seed = 9087   #To have same results

The following sparsifies thanks to PYGSP but doesn't indicate which nodes are left so useless here: G_Sparsified = reduction.graph_sparsify(G, epsilon)

We rather write a function doing this (note that the seed will be fixed):

In [None]:
def sparsify_adjacency(adjacency, All_Nodes, epsilon, seed):
    """
    Returns the sparsified adjacency as well as the list of nodes in it (with their categories)
    """
    nodes_local = All_Nodes.copy(deep=True)
    adjacency_local = copy.deepcopy(adjacency)
    
    n_nodes_initial = adjacency.shape[0]
    n_nodes_final = round(n_nodes_initial*epsilon)
    n_nodes_deleted= n_nodes_initial - n_nodes_final
    print("Will end up with "+ str(n_nodes_final)+ " nodes.")
    
    np.random.seed(seed)
    nodes_to_delete = np.random.choice(n_nodes_initial, n_nodes_deleted,replace=False)
    
    #Remove selected rows(and columns)
    adjacency_local = np.delete(adjacency_local, nodes_to_delete, axis=0)        
    adjacency_local = np.delete(adjacency_local, nodes_to_delete, axis=1)
    
    #Remove these from the list of nodes and their categories
    nodes_local = nodes_local.drop(nodes_local.index[nodes_to_delete])

    return adjacency_local, nodes_local


In [None]:
adjacency_sparsified, Nodes_Sparsified = sparsify_adjacency(adjacency, All_Nodes, 0.4, seed)

To plot the signal, we will assign a value to each category from the list assign_val (first elem for players, second countries and last national teams). The next function applies this by creating a <b>new column <i>Category_bin</i></b> in a new dataframe.

In [None]:
def assign_values(All_Nodes, assign_val):
    nodes_local = All_Nodes.copy(deep=True)
    nodes_local['Category_bin'] = ''
    size = nodes_local.shape[0]
    for i in range(size):
        cat = nodes_local.iloc[i, 1]
        if(cat == 'player'):
            nodes_local.iloc[i, 2] = assign_val[0]
        if(cat == 'Country'):
            nodes_local.iloc[i, 2] = assign_val[1]
        if(cat == 'National Team'):
            nodes_local.iloc[i, 2] = assign_val[2]
    return nodes_local

In [None]:
Nodes_Sparsified_binned = assign_values(Nodes_Sparsified, [0, 5, 10])

In [None]:
adjacency_sparsified_sparsed = sparse.csr_matrix(adjacency_sparsified)
signal_category = Nodes_Sparsified_binned['Category_bin'].values

In [None]:
G_Sparsified = graphs.Graph(adjacency_sparsified_sparsed)
fig, ax = plt.subplots(figsize=(20, 10))
G_Sparsified.set_coordinates()
plotting.plot_signal(G_Sparsified, signal_category, backend= 'matplotlib',ax = ax, \
                     plot_name = 'Graph of Hyperlinks Sparsed', vertex_size =100)
if(0):
    plt.savefig('./Images/TASK1_Graph_Hyperlinks_Sparsified.png')
plt.show()

What a wonderful mess! A structure is quite visible. In yellow, one finds the _National Teams_. These are quite visibly at the heart of the network : both connecting _Players_ (purple) and the _Countries_ (blue). Note how the countries form a connected set on the left. 

#### Let's observe the degree distribution seperately for each category. 

We take back our full system. Remember there are 736 players, 32 countries and 32 teams and that they appear in that order in the adjacency.

In [None]:
degree_dist_player_part = G.d[0:736]
degree_dist_countries_part = G.d[736:768]
degree_dist_teams_part = G.d[768:800]

In [None]:
fig = plt.figure(figsize = (15,5))
ax1 = fig.add_subplot(1,3,1)
ax1.hist(degree_dist_player_part, color = '#4d004d')
ax1.set_title('Degree Distribution of Players')
ax1.set_xlabel('Degree Value')
ax1.set_ylabel('Number of node in that range')
ax1.set_xlim(0, 140)

ax2 = fig.add_subplot(1,3,2)
ax2.hist(degree_dist_countries_part, color = '#009999')
ax2.set_title('Degree Distribution of Countries')
ax2.set_xlabel('Degree Value')
ax2.set_ylabel('Number of node in that range')
ax2.set_xlim(0, 140)

ax3 = fig.add_subplot(1,3,3)
ax3.hist(degree_dist_teams_part, color = '#e6e600')
ax3.set_title('Degree Distribution of National Teams')
ax3.set_xlabel('Degree Value')
ax3.set_ylabel('Number of node in that range')
ax3.set_xlim(0, 140)
if(0):
    plt.savefig('./Images/TASK1_Degree_Distributions.png')


plt.show()


Clearly, these different categories of node do not exhibit the same behaviour. _Countries_ are very poorly connected to the rest of the nodes. They are however a very regular target destination as one can imagine (most players point to their own country). _Players_ have an intermediary behaviour, some being heavily connected (surely famous players<b>*</b>) but most averaging around 30 connections. Finally, _National Teams_ are the most intensely connected since they link most of their players, some of the other national teams (due, for example, to matches history) and their country. Note how Croatia's National Football team appears as an outlier on the 120 centred-bin, as we saw earlier. 

<b>*</b> For example, players having more than 80 as a degree value are (node labels : [300, 455, 470, 506]). These corresponds to:

In [None]:
All_Nodes.iloc[[300, 455, 470, 506], :]

So yes, quite famous players.

## On this bombshell, this first part concludes. 

# Part 2

In [None]:
...