## Import needed libraries

In [9]:
import networkx as nx
import pandas as pd

from collections import Counter

# Votes network
The following notebook contains the code used for obtaining assortative mixing coefficients for the network built using votes as edges.
In this case, to study both positive and negative votes, we create two different networks - like and dislike networks. In particular:
- users are the nodes
- in the like network, an edge is created if one node (source) votes positively the posting of another user (target)
- in the dislike network, an edge is created if one node (source) votes negatively the posting of another user (target)

We are building a directed and weighted network, in which weights are given by the number of positive/negative votes

## Read the data
Given the fact that in the Votes data we only have information about the ID of the source node, we need to add information about the target ID, taken from the Post data.

In [10]:
votes1 = pd.read_csv("data/Votes_01052019_15052019.csv", sep = ';')
votes2 = pd.read_csv("data/Votes_16052019_31052019.csv", sep = ';')

postings1_df = pd.read_csv("data/Postings_01052019_15052019.csv", sep=';')
postings2_df = pd.read_csv("data/Postings_16052019_31052019.csv", sep=';')

In [11]:
# Create a unique dataframe and rename ID_CommunityIdentity column
votes_df = pd.concat([votes1, votes2]).rename(columns={'ID_CommunityIdentity': 'source'})

# Convert the creation date of the account to the age of the account
votes_df['UserCreatedAt'] = 2024 - (pd.to_datetime(votes_df['UserCreatedAt']).dt.year)
votes_df = votes_df.rename(columns={'UserCreatedAt': 'AccountAge'})

votes_df.head()

Unnamed: 0,source,ID_Posting,VoteNegative,VotePositive,VoteCreatedAt,UserCommunityName,UserGender,AccountAge
0,675862,1041076570,1,0,2019-05-06 16:47:46.883,Heckscheibenwischer,m,6
1,689023,1041076570,1,0,2019-05-01 22:19:06.240,Heinz Fettleber,,5
2,24810,1041076745,0,1,2019-05-01 23:54:54.600,Bruce Campbell,m,13
3,673781,1041076745,0,1,2019-05-01 20:59:29.910,Erdäpfelsack,,6
4,24810,1041076831,0,1,2019-05-01 23:51:42.730,Bruce Campbell,m,13


In [12]:
# Let us see how many unique users we have in the dataset - they will be the nodes of the networks
print(len(votes_df))
print(len(votes_df['source'].unique()))

3824984
29933


## Nodes attributes
Now we create a separate dataframe in which we store the attributes for each user in the blog - they will be the attributes of each node in the networks. 

For this analysis we are interested in:
- gender of the user
- age of the account

In [13]:
# Get df attribute
# Extract nodes attributes and store them in a pandas df - assuming that one node is a user
df_attributes = votes_df[['source', 'UserGender', 'AccountAge']].drop_duplicates()
len(df_attributes)

29933

## Merge the postings and votes datasets

In [14]:
postings = pd.concat([postings1_df, postings2_df])
postings = postings[['ID_Posting', 'ID_CommunityIdentity', 'ArticleChannel']].rename(columns={'ID_CommunityIdentity': 'target'})
postings.head()

Unnamed: 0,ID_Posting,target,ArticleChannel
0,1041073586,671476,Inland
1,1041073839,566938,Inland
2,1041073872,669286,Inland
3,1041080734,671476,Inland
4,1041080828,671476,Inland


In [15]:
finalvotes_df = pd.merge(postings, votes_df, on='ID_Posting')
finalvotes_df.head()

Unnamed: 0,ID_Posting,target,ArticleChannel,source,VoteNegative,VotePositive,VoteCreatedAt,UserCommunityName,UserGender,AccountAge
0,1041073586,671476,Inland,572203,0,1,2019-05-01 18:46:37.437,Hollerbusch,w,9
1,1041073586,671476,Inland,206222,0,1,2019-05-01 18:38:54.227,Talerkogel,m,23
2,1041073586,671476,Inland,212488,0,1,2019-05-01 22:43:43.030,Chucho,m,13
3,1041080734,671476,Inland,581206,0,1,2019-05-01 23:31:04.717,wertekonservativ,m,8
4,1041080734,671476,Inland,25580,0,1,2019-05-01 23:30:32.150,CEEIT,m,18


In [16]:
# Split in like dislike - so that two networks can be created
finalvotes_df_like = finalvotes_df[finalvotes_df['VotePositive'] == 1]
finalvotes_df_dislike = finalvotes_df[finalvotes_df['VotePositive'] == 0]

## Filter data according to the Article Channel

Due to high amount of data, we decided to measure the assortative mixing coefficients for some subgroups. To do so, data is split according to the article channel of the post and a dictionary of datasets is created to store all the different data related to each topic. 

### LIKE

In [17]:
# Group the DataFrame by the 'ArticleChannel' column
grouped_df_like = finalvotes_df_like.groupby('ArticleChannel')

# Create a dictionary to store DataFrames based on ArticleChannel
channel_dfs_like = {}

# Iterate through the grouped DataFrame and create individual DataFrames
for channel, channel_group in grouped_df_like:
    channel_dfs_like[channel] = channel_group.copy()

In [18]:
# Iterate through the dictionary and print the shape of each DataFrame
for channel, df in channel_dfs_like.items():
    print(f"Shape of DataFrame for {channel}: {df.shape}")

Shape of DataFrame for AutoMobil: (46330, 10)
Shape of DataFrame for Bildung: (27890, 10)
Shape of DataFrame for Diverses: (17733, 10)
Shape of DataFrame for Etat: (156791, 10)
Shape of DataFrame for Familie: (12420, 10)
Shape of DataFrame for Gesundheit: (32391, 10)
Shape of DataFrame for Immobilien: (14380, 10)
Shape of DataFrame for Inland: (841785, 10)
Shape of DataFrame for International: (199221, 10)
Shape of DataFrame for Karriere: (9983, 10)
Shape of DataFrame for Kultur: (94931, 10)
Shape of DataFrame for Lifestyle: (44028, 10)
Shape of DataFrame for Meinung: (397399, 10)
Shape of DataFrame for Panorama: (447382, 10)
Shape of DataFrame for Reisen: (10667, 10)
Shape of DataFrame for Sport: (130955, 10)
Shape of DataFrame for User: (52865, 10)
Shape of DataFrame for Web: (238988, 10)
Shape of DataFrame for Wirtschaft: (160545, 10)
Shape of DataFrame for Wissenschaft: (41044, 10)
Shape of DataFrame for Zukunft: (9463, 10)
Shape of DataFrame for dieStandard: (25564, 10)


### DISLIKE

In [19]:
# Group the DataFrame by the 'ArticleChannel' column
grouped_df_dislike = finalvotes_df_dislike.groupby('ArticleChannel')

# Create a dictionary to store DataFrames based on ArticleChannel
channel_dfs_dislike = {}

# Iterate through the grouped DataFrame and create individual DataFrames
for channel, channel_group in grouped_df_dislike:
    channel_dfs_dislike[channel] = channel_group.copy()

In [22]:
# Iterate through the dictionary and print the shape of each DataFrame
for channel, df in channel_dfs_dislike.items():
    print(f"Shape of DataFrame for {channel}: {df.shape}")

Shape of DataFrame for AutoMobil: (5366, 10)
Shape of DataFrame for Bildung: (8997, 10)
Shape of DataFrame for Diverses: (5758, 10)
Shape of DataFrame for Etat: (37614, 10)
Shape of DataFrame for Familie: (2690, 10)
Shape of DataFrame for Gesundheit: (10657, 10)
Shape of DataFrame for Immobilien: (3797, 10)
Shape of DataFrame for Inland: (218431, 10)
Shape of DataFrame for International: (74615, 10)
Shape of DataFrame for Karriere: (1489, 10)
Shape of DataFrame for Kultur: (27744, 10)
Shape of DataFrame for Lifestyle: (6408, 10)
Shape of DataFrame for Meinung: (127610, 10)
Shape of DataFrame for Panorama: (122928, 10)
Shape of DataFrame for Reisen: (2043, 10)
Shape of DataFrame for Sport: (35459, 10)
Shape of DataFrame for User: (3538, 10)
Shape of DataFrame for Web: (49653, 10)
Shape of DataFrame for Wirtschaft: (45756, 10)
Shape of DataFrame for Wissenschaft: (9807, 10)
Shape of DataFrame for Zukunft: (3587, 10)
Shape of DataFrame for dieStandard: (8282, 10)


For both likes and dislike networks, we focused only on a subset of channels, namely
- Sport
- Economy
- Culture
- Education
- Career
- Family
- Inland (for politics)

In [21]:
selected_channels = ['Wirtschaft', 'Sport', 'Kultur', 'Bildung', 'Karriere', 'Familie', 'Inland']  #'Wirtschaft', 'Sport', 'Kultur', 'Bildung', 'Karriere', 'Familie'  # remember to add Inland

filtered_dict_like = {k: v for k, v in channel_dfs_like.items() if k in selected_channels}
filtered_dict_dislike = {k: v for k, v in channel_dfs_dislike.items() if k in selected_channels}

## Create the networks

### LIKES
We firts focus our attention on the like network. For each DataFrame in filtered_dict_like, we create a weighted graph (with characteristics stated above). We store the graphs in a dictionary, called graph_like.

Due to the amout of data, we try to use vectorized operations to make the code efficient.

In [22]:
graph_like = {} 

# Iterate through the dictionary of DataFrames
for channel, df in filtered_dict_like.items():
        
    # Create an empty graph
    G = nx.Graph()

    # Extract columns into variables
    targets = df['target']
    sources = df['source']

    # Create edges using vectorized operations
    edges = list(zip(sources, targets))
    G.add_edges_from(edges)

    # Increment weights using Counter
    edge_weights = dict(Counter(edges))
    for (u, v), weight in edge_weights.items():
        G[u][v]['weight'] = weight

    # Use vectorized operations for attributes
    valid_nodes = df_attributes['source'].isin(G.nodes())
    valid_rows = df_attributes[valid_nodes & (df_attributes['UserGender'].isin(['m', 'f'])) & (df_attributes['AccountAge'] > 0)]

    # Set node attributes
    gender_dict = dict(zip(valid_rows['source'], valid_rows['UserGender']))
    age_dict = dict(zip(valid_rows['source'], valid_rows['AccountAge']))
    nx.set_node_attributes(G, gender_dict, 'gender')
    nx.set_node_attributes(G, age_dict, 'age')

    # Add the graph to the dictionary
    graph_like[channel] = G


### Results
No assortative mixing can be observed according to the obtained results. Indeed all obtained coefficients are very close to zero. 

In [23]:
# Iterate through the dictionary of graphs
for channel, G in graph_like.items():
    # Compute assortative mixing by gender and age
    assortativity_gender = nx.attribute_assortativity_coefficient(G, attribute='gender')
    assortativity_age = nx.attribute_assortativity_coefficient(G, attribute='age')
    degree_assortativity = nx.degree_assortativity_coefficient(G)

    # Print results with channel name
    print(f"Channel: {channel}")
    print(f"Assortative Mixing by Gender: {assortativity_gender:.4f}")
    print(f"Assortative Mixing by Age: {assortativity_age:.4f}")
    print(f"Assortive mixing by degree: {degree_assortativity:.4f}")
    print('------------------------------------------------------')

Channel: Bildung
Assortative Mixing by Gender: 0.0061
Assortative Mixing by Age: 0.0012
Assortive mixing by degree: -0.1465
------------------------------------------------------
Channel: Familie
Assortative Mixing by Gender: 0.0434
Assortative Mixing by Age: 0.0165
Assortive mixing by degree: -0.1033
------------------------------------------------------
Channel: Inland
Assortative Mixing by Gender: 0.0053
Assortative Mixing by Age: 0.0023
Assortive mixing by degree: -0.0578
------------------------------------------------------
Channel: Karriere
Assortative Mixing by Gender: 0.0205
Assortative Mixing by Age: 0.0063
Assortive mixing by degree: -0.1233
------------------------------------------------------
Channel: Kultur
Assortative Mixing by Gender: 0.0215
Assortative Mixing by Age: 0.0056
Assortive mixing by degree: -0.0693
------------------------------------------------------
Channel: Sport
Assortative Mixing by Gender: 0.0179
Assortative Mixing by Age: 0.0039
Assortive mixing by 

### DILIKES
We peform the same operations for obtaining the networks in which edges are negative votes

In [24]:
graph_dislike = {} 

# Iterate through the dictionary of DataFrames
for channel, df in filtered_dict_dislike.items():
        
    # Create an empty graph
    G = nx.Graph()

    # Extract columns into variables
    targets = df['target']
    sources = df['source']

    # Create edges using vectorized operations
    edges = list(zip(sources, targets))
    G.add_edges_from(edges)

    # Increment weights using Counter
    edge_weights = dict(Counter(edges))
    for (u, v), weight in edge_weights.items():
        G[u][v]['weight'] = weight

    # Use vectorized operations for attributes
    valid_nodes = df_attributes['source'].isin(G.nodes())
    valid_rows = df_attributes[valid_nodes & (df_attributes['UserGender'].isin(['m', 'f'])) & (df_attributes['AccountAge'] > 0)]

    # Set node attributes
    gender_dict = dict(zip(valid_rows['source'], valid_rows['UserGender']))
    age_dict = dict(zip(valid_rows['source'], valid_rows['AccountAge']))
    nx.set_node_attributes(G, gender_dict, 'gender')
    nx.set_node_attributes(G, age_dict, 'age')

    # Add the graph to the dictionary
    graph_dislike[channel] = G


### Results
Also in this network no significant results have been found, as, again, all obtained coefficients are very close to 0.

In [17]:
# Iterate through the dictionary of graphs
for channel, G in graph_dislike.items():
    # Compute assortative mixing by gender and age
    assortativity_gender = nx.attribute_assortativity_coefficient(G, attribute='gender')
    assortativity_age = nx.attribute_assortativity_coefficient(G, attribute='age')
    degree_assortativity = nx.degree_assortativity_coefficient(G)

    # Print results with channel name
    print(f"Channel: {channel}")
    print(f"Assortative Mixing by Gender: {assortativity_gender:.4f}")
    print(f"Assortative Mixing by Age: {assortativity_age:.4f}")
    print(f"Assortive mixing by degree: {degree_assortativity:.4f}")
    print('------------------------------------------------------')

Channel: Bildung
Assortative Mixing by Gender: 0.0208
Assortative Mixing by Age: 0.0065
Assortive mixing by degree: -0.1905
------------------------------------------------------
Channel: Familie
Assortative Mixing by Gender: 0.0315
Assortative Mixing by Age: 0.0092
Assortive mixing by degree: -0.2691
------------------------------------------------------
Channel: Inland
Assortative Mixing by Gender: -0.0042
Assortative Mixing by Age: -0.0014
Assortive mixing by degree: -0.1188
------------------------------------------------------
Channel: Karriere
Assortative Mixing by Gender: 0.0460
Assortative Mixing by Age: 0.0100
Assortive mixing by degree: -0.1773
------------------------------------------------------
Channel: Kultur
Assortative Mixing by Gender: -0.0015
Assortative Mixing by Age: -0.0027
Assortive mixing by degree: -0.1550
------------------------------------------------------
Channel: Sport
Assortative Mixing by Gender: -0.0129
Assortative Mixing by Age: -0.0033
Assortive mixi