$\newcommand{\vect}[1]{\boldsymbol{#1}}$
# 4
# World Wide Web, Wikipedia, and Social Networks

## Intoroduction
- The WWW bases its success on the potential offered by the hypertext markup language (html).
- It is possible to link documents and media with each other creating a network of information.
- WWW developed a series of addresses of the form “www.oup.com"
- A hierarchical classifìcation, an almost infìnite series of documents can be mapped below this address.
- The mapping is stored in specifìc servers named “domain name servers (DNS)"
- Any new page had to be found and put manually into this artificial taxonomy to be present in the list.
- Became more and more difficult as the numbers exploded (where and how to find all new pages?) and the content became more and more complex (how to assess the category of a web page?) to classify.


## Data from various sources
### WWW
- The WWW is a classic example of big data.
- The largest coherent structure created by humans.
- It is therefore of the utmost importance to be able to handle these series of data, and whenever possible to consider properly defined subsets of them.
- Starting an exploration of the web, database from the University of Milan,Italy.
    - At http://law.di.unimi.it we can find information on this site;
    - http://law.di.unimi.it/datasets.php contains a series of data collected and storedm in compressed form;
    - http://webgraph.di.unimi.it/ contains information about the Webgraph compressed graph format and instructions on how to extract it.
    - [Download ".eu" (http://law.di.unimi.it/webdata/eu-2005/)](http://law.di.unimi.it/webdata/eu-2005/)

In [None]:
%%time
# Code for loading the ".eu" portion of the WWW in 2005
# Modified by etc.

import networkx as nx

eu_DG = nx.DiGraph()
eu_DG = nx.read_edgelist('./data/eu-2005_1M.arcs', create_using=nx.DiGraph())

# generate the dictionary of node_is -> urls
count = 0
dic_nodid_urls={}

with open("./data/eu-2005.urls") as f:
    for line in f:
        dic_nodid_urls[str(count)] = line[:-1]
        count += 1

# generate the strongly connected component
scc = [(len(c), c) for c in sorted(nx.strongly_connected_components(eu_DG),
                                   key=len, reverse=True)][0][1]
eu_DG_SCC = eu_DG.subgraph(scc)

### Twitter
- Twitter and Facebook are two clear cases where networks help in measuring social relationships.
- "tweets"
- "retweets"
- "following"
    - Is not reciprocal (i.e. if A follows B, not necessarily does B follow A).
- Twitter and Facebook are two clear cases where networks help in measuring social relationships.
- Twitter APIs : https://dev.twitter.com/docs
- Python module : https://twython.readthedocs.org

In [None]:
# Code for the opening of tweets with the API
from twython import Twython

APP_KEY='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
APP_SECRET='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
OAUTH_TOKEN='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
OAUTH_TOKEN_SECRET='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

twitter_connection=Twython(APP_KEY, APP_SECRET,
                           OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

In [None]:
# How to get the timeline

res=twitter_connection.get_home_timeline()
for t in res[:5]:
    print('Tweet:', t['text'], 
          '-', t['user']['name'], '(@' + t['user']['screen_name'] + ')')
    print('Mentions:', end='')
    for m in t['entities']['user_mentions']:
        print(m['screen_name'], end='')
    print('\n')

In [None]:
# How to get user information
res = twitter_connection.show_user(screen_name='@BarackObama')
print(res)
print('Location: ', res['location'])
print('Number of followers: ', res['followers_count'])

In [None]:
# Retreving tweets with eh "search' function
res = twitter_connection.search(q='#ebola', count=2)
for t in res['statuses']:
    print("Tweet:", t['text'])

### Wikipedia
- The various pages are interconnected forming one of <u>the largest thematic subnetworks</u> of the WWW.
- Interest in this subset of WWW pages is based on a series of reasons.
    - Wikipedia is a <u>well defined subgraph</u> of the WWW; indeed it forms a <u>thematic subset</u>,thereby creating a <u>natural laboratory</u> for WWW studies.
    - Over time Wikipedia has developed in different languages, so that various subsets of Wikipedia of different sizes are now avaìlable. Furthermore,Wikipedia networks <u>allow us to test whether different cultures tend to organise web pages differently</u>.
    - <u>All information on the Wikipedia graph is available</u>, even its <u>growth history</u>, with a <u>time stamp</u> for any additions to the system.
    - Wikipedia pages tend (where possible) to cite other Wikipedia pages, so that the whole system is contained.
- If the links connecting two pages (lemmas of the encyclopaedia) determine communities of concepts and ultimately define a bottom-up taxonomy of reciprocal concepts (as one would expect).
- Download - https://dumps.wikimedia.org
- Use a small portion of Wikipedia,that “[in Limba Sarda](https://sc.wikipedia.org/)" (Sardinian),which is at the moment formed from about 4500 articles.
- The structure of the Pagelinks and Page table
    - https://www.mediawiki.org/wiki/Manual:Pagelinks_table
    - https://www.mediawiki.org/wiki/Manual:Page_table

In [None]:
# Opening the Wikipeia Sardinian dump

import _mysql

scwiki_db = mysql.connect(host="locahost", user="XXXXX",
                          passwd="XXXXX", db="scwiki_db)

scwiki_db.query("""SELECT pagelinks.pl_from, page.page_id 
FROM page, pagelinks
WHERE page.page_title = pagelinks.pltitle""")
r=scwiki_db.use_result()
f=open("./data/scwiki_edgelist.data", 'w')
res=r.fetch_row()
while res!=():
    f.wrtie(res[0][0]+" "+res[0][1]+"\n")
    res=r.fetch.row()
f.close()

scwiki_db.query("SELECT page.page_id, page.page_title FROM page")
r=scwiki_db.use_result()
f=open("./data/scwiki_page_titles.dat", 'w')
res=r.fetch_row()
while res != ():
    f.write(res[0][0]+" "+res[0][1]+"\n")
    res = r.fetchrow()
f.close()

#### Using SQLite3 (appended by etc.)
- SQL dumps were converted by following scriptor.
    https://github.com/dumblob/mysql2sqlite
- SQL import (in Un\*x)

    <code>#./mysql2sqlite scwiki-20170620-page.sql | sqlite3 scwiki_db<br>#./mysql2sqlite scwiki-20170620-pagelinks.sql.sql | sqlite3 scwiki_db</code>

In [None]:
# Opening the Wikipeia Sardinian dump using SQLite3
# Modified by etc.
import sqlite3

scwiki_db = sqlite3.connect("data/scwiki_db")

# extract the hyperlinks information
sql = """
SELECT pagelinks.pl_from, page.page_id \
FROM page, pagelinks \
WHERE page.page_title = pagelinks.pl_title
"""
    
with open("./data/scwiki_edgelist.dat", 'w') as f:
    for r in scwiki_db.execute(sql):
#         print(str(r[0])+" "+str(r[1])+"\n")
        f.write(str(r[0]) + " " + str(r[1]) + "\n")    

# extract the title information
sql = """
SELECT page.page_id, page.page_title FROM page
"""
with open("./data/scwiki_page_titles.dat", 'w') as f:
    for r in scwiki_db.execute(sql):
#         print(str(r[0])+" "+str(r[1])+"\n")
        f.write(str(r[0]) + " " + str(r[1]) + "\n")

### Wikipedia taxonomy
- Since Wikipedia is a means of organising knowledge (Gonzaga et al. , 2001) , it is interesting to check whether the structures arising from different languages and then different cultures have some sort of universality.
- The network formed by articles and hyperlinks together could provide a <u>self-organized way</u> to gather Wikipedia articles into categories; a classifìcation that it is currently <u>created upon the agreement of the whole Wikipedia community</u>.
- The simplest way to create a taxonomy is by use of a tree in the shape of the Linnean taxonomy of living organisms [(Linnaeus, 1735)](https://en.wikipedia.org/wiki/Linnaean_taxonomy)
- Such a <u>clean structure does not, unfortunately, fully apply to Wikipedia</u>.
- Articles and categories will not strictly form a perfect tree, since an article or a category may happen to be the offspring of more than one parent category.
- The taxonomy of articles is represented in this case as a <u>direct acyclic(비순환) graph</u>. 
- The taxonomy must be <u>considered only as a soft partition</u>, where the intersection between classes is different from zero. 
- In this case one deals with (so-called) fuzzy partitions.

## Bringing order to the WWW
- A short overview of the various methods that have been presented and made public <u>to infer the importance (centrality) of pages in the WWW</u>.
- Defìne the importance of a page <u>only topologically</u> i.e. <u>without entering into semantic analysis</u> of the content of a single page.

### HITS(Hyperlink-Induced Topic Search) algorithm
- by [Kleinberg, (Hubs, Authorities, and Communities, 1999)](http://cs.brown.edu/memex/ACM_HypertextTestbed/papers/10.html)
- ***Authorities*** i.e. pages that contain <u>relevant information</u> (train timetable, food recipes, formulas of algebra).
- ***Hubs*** i.e. pages that do not necessarily contain information,but (as with Yahoo! pages) <u>have links to pages where the information is stored</u>.
- Every page $i$ has both an authority score $au(i)$ and a hub score $h(i)$, that are computed via a mutual recursion.
- Define <u>the authority of one page as proportional to the sum of the hub scores of the pages pointing to it</u>.

$$au(i)\propto\sum_{j\rightarrow i}h(j)$$

- The hub score of one page is proportional to the authority scores of the pages <u>reached from the hub</u>,

$$h(j)\propto\sum_{i\rightarrow j}au(j)$$

- To ensure convergence of the above recursion, a good method is to normalise the values of $h(i)$ and $a(i)$ at every iteration s.t $\sum^n_{i=1}h(i)=\sum^n_{i=1}au(i)=1$.
- Ref.
    - [HITS(Hypertext induced Topic Selection) 알고리즘이란](http://mrseo.co.kr/hitshypertext-induced-topic-selection-알고리즘이란/)
    - [HITS algorithm@Wikipedia]( https://en.wikipedia.org/wiki/HITS_algorithm)

In [None]:
%matplotlib inline

In [None]:
# HITS algorithm
def HITS_algorithm(DG):
    auth={}
    hub={}
    
    k = 1000  # number of steps
    
    for n in DG.nodes():
        auth[n] = 1.0
        hub[n] = 1.0
        
    for k in range(k):
        norm = 0.0
        for n in DG.nodes():
            auth[n] = 0.0
            for p in DG.predecessors(n):
                auth[n] += hub[p]
            norm += auth[n]**2.0
        norm = norm**0.5
        for n in DG.nodes():
            auth[n]=auth[n]/norm
            
        norm = 0.0
        for n in DG.nodes():
            hub[n] = 0.0
            for s in DG.successors(n):
                hub[n] += auth[s]
            norm += hub[n]**2.0
        norm = norm**0.5
        for n in DG.nodes():
            hub[n] = hub[n] / norm
            
        return auth, hub
 

DG = nx.DiGraph()
DG.add_edges_from([('A', 'B'), ('B', 'C'), ('A', 'D'),
                   ('D', 'B'), ('C', 'D'), ('C', 'A')])

nx.draw(DG, with_labels=True)

(auth, hub) = HITS_algorithm(DG)

print(auth)
print(hub)

### Spectral properties
- This method can be (qualitatively, not considering the normalisation problems) described by means of linear algebra.
- A graph can be equivalently represented by means of a matrix of numbers,that is, with its adjacency matrix.

<img src="./Fig.4.1.png" width=450>
<center><font size=-1>[A simple oriented graph with its adjacency matrix]</font></center>

- The equation giving rise to the hub score, 

$$h(j)\propto\sum_{i\rightarrow j}au(j)\rightarrow h(i)\propto\sum^n_{j=1}a_{ij}au(j)\rightarrow\vec{h}\propto Aa\vec{u}$$


$$au(i)\propto\sum_{j\rightarrow i}h(j)\rightarrow au(i)\propto\sum^n_{j=1}a^T_{ij}h(j)\rightarrow\vec{u}\propto A^T\vec{h}$$
(where $a^T_{ij}$ are the elements of the matrix $A^T$ that is the transpose of $A$ (this means that $a^T_{ij}=a_{ji}$))

In [None]:
# How to transpose and mutiply a matrix
def matrix_transpose(M):
    M_out=[]
    for c in range(len(M[0])):
        M_out.append([])
        for r in range(len(M)):
            M_out[c].append(M[r][c])
    
    return M_out


def matrix_multiplication(M1, M2):
    M_out=[]
    
    for r in range(len(M1)):
        M_out.append([])
        for j in range(len(M2[0])):
            e = 0.0
            for i in range(len(M1[r])):
                e += M1[r][i] * M2[i][j]
            M_out[r].append(e)
    return M_out

adjacency_matrix1=[[0, 1, 0, 1],
                   [1, 0, 1, 1],
                   [0, 1, 0, 1]]

adjacency_matrix2 = matrix_transpose(adjacency_matrix1)

print("Transpose adjacency matrix:", adjacency_matrix2)

res_mul = matrix_multiplication(adjacency_matrix1, adjacency_matrix2)
print("Matrix multiplication:", res_mul)

In [None]:
# How to transpose and mutiply a matrix using Numpy
# Modified by etc. using NumPy
import numpy as np

adjacency_matrix1=[[0, 1, 0, 1],
                   [1, 0, 1, 1],
                   [0, 1, 0, 1]]

adjacency_matrix2 = np.transpose(adjacency_matrix1)

print("Transpose adjacency matrix:", adjacency_matrix2)

res_mul = np.dot(adjacency_matrix1, adjacency_matrix2)
print("Matrix multiplication:", res_mul)

By combining upper formulas,

$$\vec{h}\propto AA^T\vec{h}=\lambda_hAA^T\vec{h},$$
$$a\vec{u}\propto A^TAa\vec{u}=\lambda_{au}A^TA\vec{u}.$$

That is an eigenvalue problem for the matices $M\equiv AA^T$ and $M^T\equiv A^TA$.
- $M$ (and therefore its transpose) is real and symmetric, so its eigenvalues are real;
- $M$ is non-negative (i.e. the entries are at least 0 or larger); if we can find a $k > 0$ s.t. $M^k >> 0$, that is, all of the entries are strictly larger than 0, then $M$ is *primitive*. If $M$ is a primitive matrix:
    * the largest eigenvalue $\lambda$ of $M$ is positive and of multiplicity 1;
    * every other eigenvalue of $M$ is in modulus strictly less than $\lambda$;
    * the largest eigenvalue $\lambda$ has a corresponding eigenvector with all entries positive.
- Being a primitive matrix means in physical terms that the graph defined by the adjacency matrix <u>must have no dangling ends or sinks</u> and that <u>it is possible to reach any page from any starting point</u>.

In [None]:
# Pricipal eigenvalue/vector extraction (power iteration)
adjacency_matrix=[[0, 1, 0, 1],
                  [1, 0, 1, 1],
                  [0, 1, 0, 0],
                  [1, 1, 0, 0]]
vector=[[0.21], [0.34], [0.52], [0.49]]

for i in range(100):
    res = matrix_multiplication(adjacency_matrix, vector)
    norm_sq = 0.0
    for r in res:
        norm_sq = norm_sq + r[0] * r[0]
    
    vector = []
    for r in res:
        vector.append([r[0]/(norm_sq**0.5)])
        
print("Maxium eigenvalue (in absolute value):", norm_sq**0.5)
print("Eigenvector for the maxium eigenvalues:", vector)

In [None]:
# Modified by etc using NumPy
import numpy as np
from numpy import linalg as LA

adjacency_matrix=np.array([[0, 1, 0, 1],
                           [1, 0, 1, 1],
                           [0, 1, 0, 0],
                           [1, 1, 0, 0]])
vector=[[0.21], [0.34], [0.52], [0.49]]

w, v = LA.eig(adjacency_matrix)
print(w[0])
print(v[:, 0])

In [None]:
# HITS algorithm for the ".eu" domain in 2005
import operator

auth, hub = HITS_algorithm(eu_DG_SCC)
sorted_auth = sorted(auth.items(), key=operator.itemgetter(1))
sorted_hub = sorted(hub.items(), key=operator.itemgetter(1))

print("Top 5 auth")
for p in sorted_auth[:5]:
    print(dic_nodid_urls[p[0]], p[1])
                         
print("\nTop 5 hub")
for p in sorted_hub[:5]:
    print(dic_nodid_urls[p[0]], p[1])

### PageRank
- The most successful measure of eigenvector centrality is given by another algorithm.
- Give <u>only one score to the pages of the web</u>, irrespective of its role as authority or hub.
- The values of PageRank for the various pages in the graph are given by the eigenvector <mark>$\vect{r}$</mark> , related to the largest eigenvalue $\lambda_1$ of the matrix $\vect{P}$ , given by
$$\vect{P}=\alpha\vect{N}+(1-\alpha)\vect{E};$$
the weight is taken as $\alpha = 0.85$ in the original paper [(Page et al. , 1999)](http://infolab.stanford.edu/pub/papers/google.pdf). $\vect{N}$ is the normalised matrix $\vect{N} = \vect{AK0}^{-1}$ where $\vect{A}$ is the adjacency matrix and $\vect{K0^{-1}}$ is the diagonal matrix,whose entries on the diagonal are given by the inverse of the out degree , $(\vect{K0}^{-1})_{ii}=1/k^o_i$.
- This new matrix $\vect{P}$ does not differ considerably from the original one $\vect{N}$ , but has the advantage that (thanks to its irreducibility) its eigenvectors can be computed by a simple iteration procedure [Langville and Meyer (2003)](https://projecteuclid.org/download/pdf_1/euclid.im/1109190965).
<img src="./Fig.4.2.png" width=250>
<center><font size=-1>[A simple case of recucible matrix]</font></center>
- When the matrix is a mathematical theorem (by Perron and Frobenius) ensures that this chain must have a unique and positive stationary vector $\vect{r}^\infty$(Perron, 1907; Frobenius, 1912).
- Ref
    - http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

In [None]:
# Compute the PageRank
def pagerank(graph, damping_factor=0.85, max_iterations=100,
             min_delta=0.00000001):
    nodes = graph.nodes()
    graph_size = len(nodes)
    if graph_size == 0:
        return {}
        
    # initialize the page rank dict with 1/N for all nodes
    pagerank = dict.fromkeys(nodes, (1.0 - damping_factor)*1.0/graph_size)
    min_value = (1.0 - damping_factor)/len(nodes)
    
    for i in range(max_iterations):
        diff = 0  # total difference compared to last iteratction
        # computes each node PageRank based on inbound links
        for node in nodes:
            rank = min_value
            for referring_page in graph.predecessors(node):
                rank += damping_factor * pagerank[referring_page]/len(graph.neighbors(referring_page))
            diff += abs(pagerank[node] - rank)
            pagerank[node] = rank
            
        # stop if PageRank has converged
        if diff < min_delta:
            break
    return pagerank

In [None]:
# PageRank for a test network
G = nx.DiGraph()
G.add_edges_from([(1, 2), (2, 3), (3, 4), (3, 1), (4, 2)])

nx.draw(G, with_labels=True)

# our PageRank algorithm
res_pr = pagerank(G, max_iterations=10000, min_delta=0.00000001, damping_factor=0.85)
print(res_pr)

# Network PageRank function
print(nx.pagerank(G, max_iter=10000))

<img scr="Fig.4.3.png", width=350>

<img src="Fig.4.3.png" width=500>
<center><font size=-1>[This is the procedure to generate a network starting from a flux of tweets. The nodes are the twitter users and each time one of them mentions, retweets or replies to another user a link is drawn from the first to the second. The weight of a links is the number of citations between the two.]</font></center> 
-  In this case a link is drawn from a user “A" towards a user “B" if the user “A" mentions user “B" in one of their tweets

In [None]:
# Generate and plot the Twitter mention network
def generate_network(list_mentions):
    DG = nx.DiGraph()
    for l in list_mentions:
        if len(l) < 2:
            continue
        for n in l[1:]:
            if not DG.has_edge(l[0], n):
                DG.add_edge(l[0], n, weight=1.0)
            else:
                DG[l[0]][n]['weight'] += 1.0
    return DG

# extracting user and mentions for each tweet
# res = twitter_connection.search(q='#FutureDecoded', count=5000)
res = twitter_connection.search(q='#GSL', count=5000)
# the first will be the tweet user
list_users={}
list_mentions=[]
for t in res['statuses']:
    list_unique_ids=[]
    print("User Screen Name and ID:", (t['user']['screen_name'], 
t['user']['id_str']))
    list_unique_ids.append(t['user']['id_str'])
#     if not list_user.has_key(t['user']['id_str']):
    if not t['user']['id_str'] in list_users:
        list_users[t['user']['id_str']]=t['user']['screen_name']
    print("List of Mentions:", end='')
    for m in t['entities']['user_mentions']:
        if m['id_str'] != t['user']['id_str']:
            list_unique_ids.append(m['id_str'])
#             if not list_users.has_key(m['id_str']):
            if not m['id_str'] in list_users:
                list_users[m['id_str']] = m['screen_name']
        print((m['screen_name'], m['id_str']) , end='')
    print("\r")
    print(list_unique_ids)
    list_mentions.append(list_unique_ids)
    print("\n")

net_mentions = generate_network(list_mentions)

In [None]:
pos=nx.nx_pydot.graphviz_layout(net_mentions, prog='neato')
nx.draw(net_mentions, pos, node_size=50, node_color='Black')
# savefig('./data/hashtag_discussion_thread.png', dpi=600)

In [None]:
# Top PageRanks on a Twitter generated network (influencers)
import operator

pr = nx.pagerank(net_mentions, max_iter=10000)
sorted_pr = sorted(pr.items(), key=operator.itemgetter(1), reverse=True)

for page in sorted_pr[:10]:
    print(list_users[page[0]], page[1])

# Communities and Girvan-Newman algorithm
- The concept of communities is not in itself extremely precise, and also therefore methods for determining them in networks are many and refer to slightly different objects.

## Girvan-newman(GN) algorighm
- Based on a recursive deletion of edges.
- Edges are selected for their <u>bridging properties</u>, that is to say they are selected if they connect <u>dense regions</u> and therefore after their removal these dense regions appear as the communities within the system.
- The quantity chosen for this procedure is the <u>edge betweenness</u>.
- Start removing the edge with the largest value then we recompute the edge betweenness and then we delete the one with the largest betweenness among those left.
- The process is repeated until all the edges are removed.
<img src="./Fig.4.5.png" width=500>
<font size=-1>[(left) A toy graph to which we applied the GN algorithm. First we compute the edge betweenness and then we cut the edge with the largest value (dashed). Recursively, we compute and again delete all the edges one after another. Whenever the removal of one edge splits the graph,we indicate (right) the edge in bold (i.e. edges E-F, A-D, G-L, D-I, B-E, A-B, H-I, F-L, C-G). As a result we obtain the dendrogram on the right.]</font>

In [None]:
# Code for the GN algorithm
import matplotlib.pyplot as plt

G = nx.Graph()
G.add_edges_from([('A', 'B'), ('A', 'D'), ('B', 'D'), ('B', 'E'), ('E', 'I'),
                  ('D', 'I'), ('D', 'H'), ('H', 'I'), ('E', 'F'), ('F', 'C'),
                  ('F', 'L'), ('C', 'L'), ('C', 'G'), ('G', 'L')])
cnt = 1
plt.figure(cnt)
pos = nx.nx_pydot.graphviz_layout(G, prog='neato')
nx.draw(G, pos, with_labels=True)

sorted_bc=[1]
actual_number_components=1

cnt=1
while not sorted_bc==[]:
    d_edge=nx.edge_betweenness_centrality(G)
    sorted_bc = sorted(d_edge.items(), key=operator.itemgetter(1))
    e = sorted_bc.pop()
    print("deleteing edge:", e[0], end='')
    G.remove_edge(*e[0])
#     #
#     cnt += 1
#     plt.figure(cnt)
#     nx.draw(G, pos, with_labels=True)
#     #
    num_comp = nx.number_connected_components(G)
    print("...we have now ", num_comp, " components")
    if num_comp > actual_number_components:
        actual_number_components = num_comp

# Modularity
- By cutting bridging edges does not tell us when one division is better than another.
- A quantity for assessing how good the division is and therefore when we should stop.
- Steps:
    - the starting point is to consider a partition of the graph into g subgraphs;
    - if the partition is good most of the edges will be inside the subgraphs and few will connect them;
    - we then define a $g\times g$ matrix $E$ whose entries $e_{ij}$ give the fraction of edges that in the original graph connect subgraph $i$ to subgraph $j$;
    - the actual fraction of edges in subgraph $i$ is given by element $e_{ii}$;
    - the quantity $f_i=\sum_{j=1,g}e_{ij}$ gives the probability that an end-vertex of a randomly extracted edge is in subgraph $i(i\in 1,...,g)$;
    - in the absence of correlations the probability that an edge belongs to subgraph $i$ is $f^2_i$.
- Define the modularity $Q$,
$$Q=\sum^g_{i=1} e_{ii}-f^2_i$$

In [None]:
# Community detection with the Karate Club network 
import community
import networkx as nx
import matplotlib.pyplot as plt

G = nx.read_edgelist("./data/karate.dat")

# first compute the best partition
partition = community.best_partition(G)

size = float(len(set(partition.values())))
pos = nx.spring_layout(G)
count = 0.
plt.axis('off')
for com in set(partition.values()):
    count = count + 1.
    list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
    nx.draw_networkx_nodes(G, pos, list_nodes, node_size=300,
                           node_color=str(count/size))
    nx.draw_networkx_labels(G, pos)
    
nx.draw_networkx_edges(G, pos, alpha=0.5, width=1)
# savefig('./data/karate_community.png', dpi=600)

In [None]:
%%time
# Community detection for the scwiki web graph (by etc)
scwiki_pagelinks_net_dir = nx.read_edgelist("./data/scwiki_edgelist.dat",
                                            create_using=nx.DiGraph())
scwiki_pagelinks_net = nx.read_edgelist("./data/scwiki_edgelist.dat")

diz_titles={}
with open("./data/scwiki_page_titles.dat", 'r') as f:
    for line in f:
        print(line.split()[0], line.split()[1])
        diz_titles[line.split()[0]]=line.split()[1]

- The problem in plotting this network is that it comprises almost 10,000 nodes.
- <u>Generate a representative network</u> in which each node is a community (we consider just the first nine with more than 200 nodes), with size proportional to the number of nodes in the corresponding community and edge weight proportional to the number of edges between each pair of communities (we cut the link below the threshold weight 100).
- The representative node is chosen <u>according to the Pagerank</u> inside the corresponding community.

In [None]:
# Generate and optimise the representative network of the community structure

#optimization
partition = community.best_partition(scwiki_pagelinks_net)

# Generate representative ndoes of the community structure
community_structure = nx.Graph()
diz_communities={}
diz_node_labels={}
diz_node_sizes={}
max_node_size = 0
for com in set(partition.values()):
    diz_communities[com] = [nodes for nodes in partition.keys() 
                            if partition[nodes] == com]
    if len(diz_communities[com]) >= 200:
        if max_node_size < len(diz_communities[com]):
            max_node_size = len(diz_communities[com])
        print("community", com, len(diz_communities[com]), end='')
        sub_scwiki_dir = scwiki_pagelinks_net_dir.subgraph(diz_communities[com])
        res_pr = nx.pagerank(sub_scwiki_dir, max_iter=10000)
        sorted_pr = sorted(res_pr.items(), key=operator.itemgetter(1), reverse=True)
        print(diz_titles[sorted_pr[0][0]], sorted_pr[0][1])
        community_structure.add_node(com)
        diz_node_labels[com] = diz_titles[sorted_pr[0][0]]
        diz_node_sizes[com] = len(diz_communities[com])
        
# Generate edge weights according to the number of links among commnuities
max_edge_weight=0.0
for i1 in range(community_structure.number_of_nodes()-1):
    for i2 in range(i1+1, community_structure.number_of_nodes()):
        wweight=0.0
        for n1 in diz_communities[community_structure.nodes()[i1]]:
            for n2 in diz_communities[community_structure.nodes()[i2]]:
                if scwiki_pagelinks_net.has_edge(n1, n2):
                    wweight = wweight + 1.0
            
        if wweight > 100.0:
            if max_edge_weight < wweight:
                max_edge_weight = wweight
            community_structure.add_edge(community_structure.nodes()[i1],
                                         community_structure.nodes()[i2],
                                         weight=wweight)

In [None]:
# Plotting the representative network of the community staructure
import matplotlib.pyplot as plt

pos = nx.nx_pydot.graphviz_layout(community_structure, prog='circo')
node_size_factor = 2000.0
edge_weight_factor = 10.0

plt.axis('off')

for n in community_structure.nodes():
    nx.draw_networkx_nodes(community_structure, pos, [n], 
                           node_size=node_size_factor*diz_node_sizes[n]/max_node_size,
                           node_color='Black')
    nx.draw_networkx_labels(community_structure, pos, font_color='White',
                            axis='off')
    
for e in community_structure.edges():
    nx.draw_networkx_edges(community_structure, pos, [e], alpha=0.5,
                           width=edge_weight_factor*community_structure[e[0]][e[1]]['weight']/max_edge_weight)
    