## Import needed libraries

In [5]:
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# POSTINGS NETWORK
The following notebook contains the code used for obtaining assortative mixing coefficients for the network built using posting's replies as edges. In particular:
- users are the nodes
- an edge is created if one node (source) replies to the posting of another user (target)
We are building a directed and weighted network, in which weights are given by the number of times a source replied to a target node

## Setup & input

In [3]:
DATA_DIR = Path("../../data/raw/")
filename_first = "Postings_01052019_15052019.csv"
filename_second = "Postings_16052019_31052019.csv"

# use this output-path for saving figures
FIG_OUTPUT_PATH = Path("../../reports/figures/postings/")

## Read the data

In [4]:
postings1_df = pd.read_csv(DATA_DIR / filename_first, sep=';', dtype=str)
postings2_df = pd.read_csv(DATA_DIR / filename_second, sep=';', dtype=str)

FileNotFoundError: [Errno 2] No such file or directory: '..\\..\\data\\raw\\Postings_01052019_15052019.csv'

In [7]:
# Read data about all postings in the period 01.05.2019 - 31.05.2019
postings1_df = pd.read_csv('Postings_01052019_15052019.csv', sep=';')
postings2_df = pd.read_csv('Postings_16052019_31052019.csv', sep=';')

In [8]:
# Create a unique dataframe
postings_df = pd.concat([postings1_df, postings2_df])

# Convert the creation date of the account to the "age" of the account
postings_df['UserCreatedAt'] = 2024 - (pd.to_datetime(postings_df['UserCreatedAt']).dt.year)
postings_df = postings_df.rename(columns={'UserCreatedAt': 'AccountAge'})

postings_df.head()

Unnamed: 0,ID_Posting,ID_Posting_Parent,ID_CommunityIdentity,PostingHeadline,PostingComment,PostingCreatedAt,ID_Article,ArticlePublishingDate,ArticleTitle,ArticleChannel,ArticleRessortName,UserCommunityName,UserGender,AccountAge
0,1041073586,1041073000.0,671476,Das hat gestern bereits der Voggenhuber angefü...,schieder hatte dem inhaltlich nichts entgegenz...,2019-05-01 18:21:15.127,2000102330973,2019-05-01 10:28:57.49,1. Mai in Wien: SPÖ fordert von Strache Rücktritt,Inland,Parteien,Ravenspower,,6
1,1041073839,1041073000.0,566938,,...und meinen Bezirk bekommst du als Erbe mit.,2019-05-01 18:28:22.040,2000102330973,2019-05-01 10:28:57.49,1. Mai in Wien: SPÖ fordert von Strache Rücktritt,Inland,Parteien,AlphaRomeo,m,9
2,1041073872,1041069000.0,669286,,"Nein, bei der ÖVP/FPÖ genauso passiert. Ich wo...",2019-05-01 18:29:05.533,2000102330973,2019-05-01 10:28:57.49,1. Mai in Wien: SPÖ fordert von Strache Rücktritt,Inland,Parteien,Hpolditsch,,6
3,1041080734,1041080000.0,671476,Sie haben doch nichts gefordert??,sie haben nur die regierung kritisiert. das di...,2019-05-01 22:37:56.010,2000102330973,2019-05-01 10:28:57.49,1. Mai in Wien: SPÖ fordert von Strache Rücktritt,Inland,Parteien,Ravenspower,,6
4,1041080828,,671476,Heute wäre der perfekte Tag für die SPÖ gewese...,"ihr noch nicht erfülltes versprechen, den silb...",2019-05-01 22:42:06.310,2000102330973,2019-05-01 10:28:57.49,1. Mai in Wien: SPÖ fordert von Strache Rücktritt,Inland,Parteien,Ravenspower,,6


In [9]:
# Let us see how many unique users we have in the dataset - they will be the nodes of the networks
print(len(postings_df))
print(len(postings_df['ID_CommunityIdentity'].unique()))

739094
23925


## Nodes attributes
Now we create a separate dataframe in which we store the attributes for each user in the blog - they will be the attributes of each node in the networks. 

For this analysis we are interested in:
- gender of the user
- age of the account

In [10]:
# Extract nodes attributes and store them in a pandas df - assuming that one node is a user
df_attributes = postings_df[['ID_CommunityIdentity', 'UserGender', 'AccountAge']].drop_duplicates()
len(df_attributes)
df_attributes.head()

Unnamed: 0,ID_CommunityIdentity,UserGender,AccountAge
0,671476,,6
1,566938,m,9
2,669286,,6
6,523527,m,11
7,74674,,17


## Filter data according to the Article Channel
Due to high amount of data, we decided to measure the assortative mixing coefficients for some subgroups. To do so, data is split according to the article channel of the post and a dictionary of datasets is created to store all the different data related to each topic. 

In [11]:
# Get the list of unique channels
channels = postings_df['ArticleChannel'].unique()

print(channels)
print(len(channels))

['Inland' 'Meinung' 'International' 'Sport' 'Panorama' 'Wirtschaft'
 'Gesundheit' 'Web' 'Kultur' 'Bildung' 'Wissenschaft' 'Immobilien' 'User'
 'dieStandard' 'Etat' 'Zukunft' 'Karriere' 'Lifestyle' 'Diverses' 'Reisen'
 'AutoMobil' 'Familie']
22


In [12]:
# Get the list of unique sub_channels for Inland
sub_channels = postings_df[postings_df['ArticleChannel'] == 'Inland']['ArticleRessortName'].unique()

print(sub_channels)
print(len(sub_channels))

['Parteien' 'Sachpolitik' 'SPÖ' 'Sozialpolitik & Armut' 'Inland'
 'Heer & Zivildienst' 'Niederösterreich' 'Neos' 'Rechtsextremismus' 'ÖVP'
 'Nationalsozialismus' 'Koalition' 'Gesundheitspolitik'
 'Religion & Politik' 'Österreich und EU' 'Parlament' 'Tirol' 'FPÖ'
 'Integrationspolitik' 'Oberösterreich' 'Salzburg' 'Phänomen Hass'
 'Pensionen' 'Öffentlicher Dienst' 'Staat & Justiz' 'Bundespräsident'
 'Standardabweichung' 'Grüne' 'Verfassungsschutz' 'Nationalrat'
 'Liste Jetzt' 'Kärnten' 'Bundesländer' 'Sicherheitspolitik' 'Eurofighter'
 'Burgenland' 'Wiener Politik' 'Vorarlberg' 'Kulturpolitik'
 'Lobbying & Korruption' 'Politische Umfragen'
 'Blog: Stadt, Land, Politik'
 'EU-Wahl 2019 in Österreich: Wahlergebnisse und -gr'
 'Nationalratswahl 2017']
44


Article channels can be further divided into more specific sub-channels. Given that most of the topics related to the "Inland" channel deal with internal politics, we decided to use data coming from the "Inland" df to study homophily of the user in the political field. 

In [13]:
# Group the DataFrame by the 'ArticleChannel' column
grouped_df = postings_df.groupby('ArticleChannel')

# Create a dictionary to store DataFrames based on ArticleChannel
channel_dfs = {}

# Iterate through the grouped DataFrame and create individual DataFrames
for channel, channel_group in grouped_df:
    channel_dfs[channel] = channel_group.copy()

In [14]:
#Print the shape of each DataFrame - to obtain information about the lenght 
for channel, df in channel_dfs.items():
    print(f"Shape of DataFrame for {channel}: {df.shape}")

Shape of DataFrame for AutoMobil: (6366, 14)
Shape of DataFrame for Bildung: (7942, 14)
Shape of DataFrame for Diverses: (4597, 14)
Shape of DataFrame for Etat: (34890, 14)
Shape of DataFrame for Familie: (2658, 14)
Shape of DataFrame for Gesundheit: (10490, 14)
Shape of DataFrame for Immobilien: (5071, 14)
Shape of DataFrame for Inland: (188069, 14)
Shape of DataFrame for International: (55911, 14)
Shape of DataFrame for Karriere: (2459, 14)
Shape of DataFrame for Kultur: (24013, 14)
Shape of DataFrame for Lifestyle: (11124, 14)
Shape of DataFrame for Meinung: (96008, 14)
Shape of DataFrame for Panorama: (98391, 14)
Shape of DataFrame for Reisen: (3285, 14)
Shape of DataFrame for Sport: (40396, 14)
Shape of DataFrame for User: (12760, 14)
Shape of DataFrame for Web: (59589, 14)
Shape of DataFrame for Wirtschaft: (50755, 14)
Shape of DataFrame for Wissenschaft: (12565, 14)
Shape of DataFrame for Zukunft: (4228, 14)
Shape of DataFrame for dieStandard: (7527, 14)


We focused only on a subset of channels, namely
- Sport
- Economy
- Culture
- Education
- Career
- Family
- Inland (for politics)

In [15]:
# Consider only a subset of DataFrame in the dictionary
selected_channels = ['Wirtschaft', 'Sport', 'Kultur', 'Bildung', 'Karriere', 'Familie', 'Inland']
filtered_dict = {k: v for k, v in channel_dfs.items() if k in selected_channels}

## Build the networks
For each DataFrame in filtered_dict, we create a weighted networks (with characteristics stated above). We store the networks in a dictionary.

In [16]:
###### Disclaimer: this cell is rather slow

# Create dictionaries to store edge lists and weighted edge lists
edge_lists = {}
weighted_edge_lists = {}

# Create a dictionary to store df containing source (repliyng user), target (replied user) and weight (number of replies)
edges_dfs = {} 

### Obtain edge list and df for each DataFrames in filtered_dict
for channel, df in filtered_dict.items():
    # Obtain the list of edges for the current DataFrame
    edgeList = [
        [post.ID_CommunityIdentity, next(iter(df[df.ID_Posting == post.ID_Posting_Parent].ID_CommunityIdentity))]
        for idx, post in df.iterrows()
        if not pd.isna(post.ID_Posting_Parent)
    ]

    # Obtain the weights for the edges
    weightedEdgeList = [(edge[0], edge[1], edgeList.count(edge)) for edge in edgeList]
    weightedEdgeList = list(set(weightedEdgeList))

    # Store the lists for each DataFrame in the dictionaries
    edge_lists[channel] = edgeList
    weighted_edge_lists[channel] = weightedEdgeList

    # Create a DataFrame for the current list of edges
    edges_df = pd.DataFrame(weightedEdgeList, columns=['source', 'target', 'weight'])         # needed for creating the network
    edges_dfs[channel] = edges_df

In [17]:
# Create and store a graph for each DataFrame in edges_df
graphs = {}
for channel_name, df in edges_dfs.items():
    G = nx.from_pandas_edgelist(df, 'source', 'target', edge_attr='weight', create_using=nx.DiGraph())

    # account for missing values, using the attributes df created before
    for _, row in df_attributes.iterrows():
        if row['ID_CommunityIdentity'] in G:
            if row['UserGender'] in ['m', 'f'] and row['AccountAge'] > 0:        # no missing values should occur in the AccountAge variable
                nx.set_node_attributes(G, {row['ID_CommunityIdentity']: row['UserGender']}, 'gender')
                nx.set_node_attributes(G, {row['ID_CommunityIdentity']: row['AccountAge']}, 'age')
    
    graphs[channel_name] = G

## Results
According to the obtained results, assortative mixing behaviour cannot be observe for none of the considered features as values are very close to zero for all attributes in all article channels

In [18]:
# Iterate through the dictionary of graphs
for channel, G in graphs.items():
    # Compute assortative mixing gender and age and node degree
    assortativity_gender = nx.attribute_assortativity_coefficient(G, attribute='gender')
    assortativity_age = nx.attribute_assortativity_coefficient(G, attribute='age')
    degree_assortativity = nx.degree_assortativity_coefficient(G, x='in', y='in', weight='weight')

    # Print results with channel name
    print(f"Channel: {channel}")
    print(f"Assortative Mixing by Gender: {assortativity_gender:.4f}")
    print(f"Assortative Mixing by Age: {assortativity_age:.4f}")
    print(f"Assortive mixing by degree: {degree_assortativity:.4f}")
    print('------------------------------------------------------')

Channel: Bildung
Assortative Mixing by Gender: 0.0387
Assortative Mixing by Age: 0.0306
Assortive mixing by degree: 0.0319
------------------------------------------------------
Channel: Familie
Assortative Mixing by Gender: -0.0164
Assortative Mixing by Age: 0.0032
Assortive mixing by degree: 0.1127
------------------------------------------------------
Channel: Inland
Assortative Mixing by Gender: 0.0137
Assortative Mixing by Age: 0.0120
Assortive mixing by degree: -0.0114
------------------------------------------------------
Channel: Karriere
Assortative Mixing by Gender: 0.0365
Assortative Mixing by Age: 0.0193
Assortive mixing by degree: 0.0990
------------------------------------------------------
Channel: Kultur
Assortative Mixing by Gender: 0.0394
Assortative Mixing by Age: 0.0220
Assortive mixing by degree: -0.0099
------------------------------------------------------
Channel: Sport
Assortative Mixing by Gender: 0.0228
Assortative Mixing by Age: 0.0144
Assortive mixing by de