# Social Network Analysis - Python Handson

In [1]:
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt

## The Data

The data set is provided by Der Standard, one of the top Austrian newspapers.
In the online Standard people can post comments below articles and up/down vote comments.
The data set used in this handson and further in the project part of the course will consider a sample of those articles, comments, and votes. 

In [2]:
df = pd.read_csv('Sample_Articles_Int_081116_091116_Postings.csv', sep=';')

In [3]:
df.head()

Unnamed: 0,ID_CommunityIdentity,ID_Article,ID_Posting,ID_Posting_Parent,PostingHeadline,PostingComment,PostingCreatedAt,ArticleType,ArticleTitle,ArticleChannel,ArticleRessortName,UserCommunityName,UserGender,UserCreatedAt
0,19,2000047115078,1016194519,1016178000.0,Sockenpuppenstadt höchstwahrscheinlich.....,,2016-11-08 19:25:36,Artikel,Osteuropa fürchtet bei Trump-Sieg um Nato-Schu...,International,USA,abraxas17,m,2009-12-31 16:08:53
1,19,2000047167560,1016190436,,Die Goldsteins der Welt haben die Untersuchung...,DAS ist die einzige Erklärung dazu.,2016-11-08 17:12:35,Artikel,Wie das FBI 650.000 E-Mails in 691.200 Sekunde...,International,USA,abraxas17,m,2009-12-31 16:08:53
2,19,2000047115078,1016192030,1016178000.0,So dürfen sie das auch nicht sehen. Die Nato i...,Die Wahrheit: Europa und Russland wären eine S...,2016-11-08 18:01:54,Artikel,Osteuropa fürchtet bei Trump-Sieg um Nato-Schu...,International,USA,abraxas17,m,2009-12-31 16:08:53
3,19,2000047115078,1016194457,1016183000.0,"Das setzt aber voraus, das man lesen kann und ...",Das könnens doch ned von dieser Generation ver...,2016-11-08 19:23:30,Artikel,Osteuropa fürchtet bei Trump-Sieg um Nato-Schu...,International,USA,abraxas17,m,2009-12-31 16:08:53
4,19,2000047211042,1016237514,1016237000.0,"Eu, die Blattlinie hat mich gelöscht. Es darf ...",Das nennt man unabhängige Presse mit Bildungsa...,2016-11-09 13:40:15,Artikel,"Alter, Bildung, Geschlecht: Wer für Donald Tru...",International,US-Wahl 2016,abraxas17,m,2009-12-31 16:08:53


There are different entities in the data set: 
* Users - identified by *ID_CommunityIdentity* (or *UserCommunityName*)
* Postings - identified by *ID_Posting*
* Articles - identified by *ID_Article*

Thus, there are different possibilities to build networks based on the posting data. 
We will concentrate now on the ***reply-to-network***. 


In [4]:
votes = pd.read_csv('Sample_Articles_Int_081116_091116_Votes.csv', sep=';')
votes.head()

Unnamed: 0,ID_CommunityIdentity_v,ID_Posting,VoteNegative,VotePositive,VoteCreatedAt,CommunityName_v,Gender_v,UserCreatedAt_v
0,377,1016177389,0,1,2016-11-08 11:10:08,agentjohnson,m,2009-11-27 15:27:08
1,377,1016176190,0,1,2016-11-08 11:09:49,agentjohnson,m,2009-11-27 15:27:08
2,377,1016176083,0,1,2016-11-08 11:10:41,agentjohnson,m,2009-11-27 15:27:08
3,377,1016176007,0,1,2016-11-08 11:10:48,agentjohnson,m,2009-11-27 15:27:08
4,377,1016176650,0,1,2016-11-08 11:09:37,agentjohnson,m,2009-11-27 15:27:08


## Reply-To-Network

The two fundamental components of a network are *nodes* and *edges*. 
In the anticipated reply-to-network nodes are the users (i.e., *ID_CommunityIdentity*). 
Edges between two nodes (i.e., users) are build if one user replys to a posting of another users. 

In [5]:
df[["ID_CommunityIdentity", "ID_Posting", "ID_Posting_Parent"]].head(15)

Unnamed: 0,ID_CommunityIdentity,ID_Posting,ID_Posting_Parent
0,19,1016194519,1016178000.0
1,19,1016190436,
2,19,1016192030,1016178000.0
3,19,1016194457,1016183000.0
4,19,1016237514,1016237000.0
5,19,1016187933,1016187000.0
6,19,1016245149,1016244000.0
7,19,1016188639,
8,19,1016239311,
9,19,1016248639,


A line in the table above shows that a user (i.e., *ID_CommunityIdentiy*) posted a comment. Every post has its own uniqe identifier (i.e., *ID_Posting*). If a user replys to a previous posting then the posting they are targeting is identified by *ID_Posting_Parent*. *NaN* shows that the posted comment is located in the root (i.e., it's not targeted towards any other comment). 

We want to bring the structure above into following format: 
* source, i.e., the replying user
* target, i.e., the targeted user
* weight, i.e., how often the source replied to the target

In other words, we are aiming for a *weighted edge-list*.

### Edges

In [6]:
edgeList = [
    [post.ID_CommunityIdentity, next(iter(df[df.ID_Posting == post.ID_Posting_Parent].ID_CommunityIdentity))] 
    for idx, post in df.iterrows()
    if ~np.isnan(post.ID_Posting_Parent)]

In [7]:
edgeList[:10]

[[19, 174710],
 [19, 48423],
 [19, 520799],
 [19, 601254],
 [19, 85349],
 [19, 564105],
 [19, 565489],
 [19, 213715],
 [19, 579244],
 [19, 227468]]

In [8]:
weightedEdgeList = [(edge[0],edge[1],edgeList.count(edge)) for edge in edgeList]
weightedEdgeList = list(set(weightedEdgeList))

In [9]:
weightedEdgeList[:10]

[(564597, 566270, 1),
 (579440, 247352, 1),
 (572502, 529976, 1),
 (72562, 172063, 1),
 (558263, 564781, 1),
 (506220, 37147, 1),
 (589704, 501174, 1),
 (189316, 574565, 1),
 (533194, 500128, 2),
 (585246, 531258, 1)]

In [10]:
edges = pd.DataFrame(weightedEdgeList, columns=['source','target','weight'])

In [11]:
edges.head()

Unnamed: 0,source,target,weight
0,564597,566270,1
1,579440,247352,1
2,572502,529976,1
3,72562,172063,1
4,558263,564781,1


In [12]:
edges.to_csv("reply_to_edges.csv", index=False)

### Graph

We use the *networkx* library.
Since we build a *reply-to-network* we have *source* nodes and *target* nodes. 
Thus, the network is directed.
Therefore, we use *nx.Digraph()*

In [13]:
G = nx.from_pandas_edgelist(edges, 
                            source='source', 
                            target='target', 
                            edge_attr = 'weight',
                            create_using=nx.DiGraph())

In [14]:
"""
fig = plt.figure(figsize=(50,50))
nx.draw_spring(G)
plt.show()
"""

'\nfig = plt.figure(figsize=(50,50))\nnx.draw_spring(G)\nplt.show()\n'

## Link prediction

### Common neigbhors (higest number)

In [145]:
def common_neighbors(G, pairs):
    return [(pairs[i][0], pairs[i][1],len(list(nx.common_neighbors(G, pairs[i][0], pairs[i][1])))) for i in range(len(pairs))]

In [143]:
def compute_similarity(G, method):
    similarity_matrix = np.zeros((len(G.nodes()), len(G.nodes())), dtype=float)

    match method:
        case "common_neigbhors":
            method_func = common_neighbors
        case "jaccard":
            method_func = nx.jaccard_coefficient
        case "adamic_adar":
            method_func = nx.adamic_adar_index
        case "preferential_attachment":
            method_func = nx.preferential_attachment
        case "resource_allocation":
            method_func = nx.resource_allocation_index
        case _:
            raise ValueError("Unknown method")

    for i, n1 in enumerate(G.nodes()):
        if i % 500 == 0:
            print("i=",i, "len(G.nodes())=", len(G.nodes()))
    
        for j, n2 in enumerate(G.nodes()):
            if n1 != n2:
                similarity_matrix[i][j] = list(method_func(G, [(n1, n2)]))[0][2]
                
    return similarity_matrix

### Validation

In [144]:
def map_index_to_node(G, index):
    return list(G.nodes())[index]

def map_node_to_index(G, node):
    return list(G.nodes()).index(node)

def evaluate(n_to_cut, method):
    # cut n_to_cut edges from G
    eval_G = G.copy().to_undirected()
    # generate a list of n_to_cut unique number at rando mbetwee 0 and to len(G.edges())
    edge_idx = np.random.choice(len(eval_G.edges()), n_to_cut, replace=False)


    cutted_edges = [list(eval_G.edges())[idx] for idx in edge_idx]

    # cut the edges from the graph
    eval_G.remove_edges_from(cutted_edges)
    
    # compute similarity matrix
    #similarity_matrix = np.zeros((len(eval_G.nodes()), len(eval_G.nodes())))
    similarity_matrix = compute_similarity(eval_G, method)
    # get the top 10 edges by index1, index2 and similarity
    flattened_matrix = [(i, j, similarity_matrix[i][j]) for i in range(len(similarity_matrix)) for j in range(len(similarity_matrix[0]))]
    
    # Sort the flattened matrix based on the values in descending order
    sorted_flat_matrix = sorted(flattened_matrix, key=lambda x: x[2], reverse=True)
    
    # print where the cutted edges are in the sorted matrix
    for edge in cutted_edges:
        edge = (map_node_to_index(eval_G, edge[0]), map_node_to_index(eval_G, edge[1]))
        for i, (index1, index2, value) in enumerate(sorted_flat_matrix):
            if (index1, index2) == edge:
                print("edge=", edge, "index=", i, "value = ", sorted_flat_matrix[i])
        
    return 0

evaluate(10, "jaccard")


i= 0 len(G.nodes())= 1915
i= 500 len(G.nodes())= 1915
i= 1000 len(G.nodes())= 1915
i= 1500 len(G.nodes())= 1915
edge= (605, 606) index= 1196660 value =  (605, 606, 0.0)
edge= (198, 905) index= 24652 value =  (198, 905, 0.1111111111111111)
edge= (84, 612) index= 235745 value =  (84, 612, 0.0)
edge= (370, 371) index= 761426 value =  (370, 371, 0.0)
edge= (439, 1006) index= 889536 value =  (439, 1006, 0.0)
edge= (153, 1378) index= 363660 value =  (153, 1378, 0.0)
edge= (15, 79) index= 77130 value =  (15, 79, 0.023529411764705882)
edge= (30, 769) index= 140622 value =  (30, 769, 0.0)
edge= (388, 489) index= 48909 value =  (388, 489, 0.0625)
edge= (391, 758) index= 800155 value =  (391, 758, 0.0)


0

### compare list 

In [40]:
top_10 = [top10_common_neigbhors, top10_common_neigbhors_prop, top10_jacard, top10_pref_attach, top10_academic_adar]
# present common element in the lists
set.intersection(*map(set, top_10))


set()

In [41]:
# sort top_10 element by name
top_10_sorted = []

for top in top_10:
    top_10_sorted.append(sorted(top))
    print(top)

[118796 130267 576477  90073 118809 119031 119068 647332 143704 151360]
[3337817 2745255 2318802 2059921 3509401 2969701   26670  202451 2670760
 2194458]
[3392874 2542034 2497001 1215964 2242306 3154018 2049712 2801635 3408450
 2060166]
[118798 143687 151315  57529 126452 118796  57512 118760 151347 118809]
[ 90073 130267 151347 118809 119031 576477 119068 647332 151360 143704]


In [63]:
def top_k_elements(matrix, k):
    # Flatten the 2D matrix and store the indices along with values
    flattened_matrix = [(i, j, matrix[i][j]) for i in range(len(matrix)) for j in range(len(matrix[0]))]
    
    # Sort the flattened matrix based on the values in descending order
    sorted_matrix = sorted(flattened_matrix, key=lambda x: x[2], reverse=True)
    
    # Select the top k elements
    top_k = sorted_matrix[:k]
    
    return top_k

# Example usage
matrix = [
    [5, 8, 3],
    [2, 7, 1],
    [9, 4, 6]
]

k = 4
result = top_k_elements(matrix, k)

print(f"Top {k} elements:")
for i, j, value in result:
    print(f"Index1: {i}, Index2: {j}, Value: {value}")


Top 4 elements:
Index1: 2, Index2: 0, Value: 9
Index1: 0, Index2: 1, Value: 8
Index1: 1, Index2: 1, Value: 7
Index1: 2, Index2: 2, Value: 6
