## 02 - Graph Creation and Visualization

In [None]:
import pandas as pd
import math
import os
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd

### Loading Cleaned Dataframe and Gathering data for Graph Creation

In [None]:
# Defining working directory and data path
wd = os.getcwd()
data_path = wd + "/data/"

In [None]:
# Loading dataset
df = pd.read_csv(data_path + "reddit_dataset_compiled_clean.csv", names=[
    "link","comment_id","user_id","parent","timestamp","title","text","Original File Source","Original User Source"
], engine='python', on_bad_lines='skip')
# Drop first row with column names
df = df.drop(0)

The following code is bit tricky, but essentially it extracts user interactions (edges) and user (nodes). However to explain it more in detail:

Upon initializing the process, the root nodes were identified as a user who posted a thread. This could be seen as a datum that did not have a parent ID (parent IDs would refer to the comment ID, which the post replied to). Since users would have interacted with the thread posts or replied to those thread posts, or replied to the replies of the thread posts, or replied to the replies of the replies of the thread posts, and so on, the data spanned a complex hierarchical structure. To extract user interactions with the initialization of the root nodes, we iteratively ran through the data, where we would check for each iteration if a datum had a parent ID in the established graph; if a parent ID was located, an interaction, or edge, would be added from the author of the post to the author of the parent post. Here, the author of the post would also be appended to the graph, if they were not already. Moreover, the posts themselves would be stored under their author's user ID while saving which subreddit the posts originated from. To further review the graph population, we suggest looking into the code 

In [None]:
incels_reduced = df  # Define this first
incels_reduced_small = incels_reduced.copy()  # Work with a copy
incels_reduced_small.fillna(False, inplace=True)

chats = {}
edges = []
nodes = set()
node_attributes = {}

# First pass: Add all root posts (posts with titles)
for idx, row in incels_reduced_small.iterrows():
    link, c_id, u_id, p_id, timestamp, title, text, c_origin, u_origin = row
    if not text:
        text = ""
    if not p_id:  # This is a root post
        chats[c_id] = {
            'text': text,
            'user_id': u_id
        }
        # Populate node attributes
        if node_attributes.get(u_id):
            if node_attributes[u_id]['posts'].get(c_origin.replace('r/', '')):
                node_attributes[u_id]['posts'][c_origin.replace('r/', '')].append(text)
            else:
                node_attributes[u_id]['posts'][c_origin.replace('r/', '')] = [text]
        else:
            if not u_origin:
                u_origin = "unknown"
            node_attributes[u_id] = {
                'posts': {
                    c_origin.replace('r/', ''): [text]
                    },
                'subreddit_origin' : u_origin
            }
            nodes.add(u_id)

print(f"Root posts found: {len(chats)}")

# Multiple passes to handle parent-child relationships
max_iterations = 50
iteration = 0


while not incels_reduced_small.empty and iteration < max_iterations:
    indices_to_drop = []
    incels_reduced_small = incels_reduced_small.iloc[::-1]
    for idx, row in incels_reduced_small.iterrows():
        link, c_id, u_id, p_id, timestamp, title, text, c_origin, u_origin = row
        
        # Skip if already processed (root posts)
        if c_id in chats:
            indices_to_drop.append(idx)
            continue
            
        # Process if parent exists in chats
        if p_id in chats:
            chats[c_id] = {
                'text': text,
                'user_id': u_id,
            }   
            if chats[p_id]['user_id'] != u_id: # Avoid self-loops
                edges.append((u_id, chats[p_id]['user_id'])) 
            
            if not text:
                text = ""
            # Populate node attributes
            if node_attributes.get(u_id):
                if node_attributes[u_id]['posts'].get(c_origin.replace('r/', '')):
                    node_attributes[u_id]['posts'][c_origin.replace('r/', '')].append(text)
                else:
                    node_attributes[u_id]['posts'][c_origin.replace('r/', '')] = [text]
            else:
                if not u_origin:
                    u_origin = "unknown"
                node_attributes[u_id] = {
                    'posts': {
                        c_origin.replace('r/', ''): [text]
                        },
                    'subreddit_origin' : u_origin
                }
                nodes.add(u_id)
            indices_to_drop.append(idx)
    
    print(f"Iteration {iteration + 1}: Processing {len(indices_to_drop)} comments")
    
    
    if not indices_to_drop:
        print("No more comments can be processed - breaking")
        break   
    
    incels_reduced_small = incels_reduced_small.drop(index=indices_to_drop)
    print(f"Remaining rows: {len(incels_reduced_small)}")
    iteration += 1

print(f"\nFinal results:")
print(f"Total comments processed: {len(chats)}")
print(f"Total edges created: {len(edges)}")


Root posts found: 421638
Iteration 1: Processing 1720366 comments
Remaining rows: 1377874
Iteration 2: Processing 819119 comments
Remaining rows: 558755
Iteration 3: Processing 131433 comments
Remaining rows: 427322
Iteration 4: Processing 150186 comments
Remaining rows: 277136
Iteration 5: Processing 28024 comments
Remaining rows: 249112
Iteration 6: Processing 40553 comments
Remaining rows: 208559
Iteration 7: Processing 7237 comments
Remaining rows: 201322
Iteration 8: Processing 12256 comments
Remaining rows: 189066
Iteration 9: Processing 2203 comments
Remaining rows: 186863
Iteration 10: Processing 4339 comments
Remaining rows: 182524
Iteration 11: Processing 793 comments
Remaining rows: 181731
Iteration 12: Processing 1637 comments
Remaining rows: 180094
Iteration 13: Processing 301 comments
Remaining rows: 179793
Iteration 14: Processing 731 comments
Remaining rows: 179062
Iteration 15: Processing 127 comments
Remaining rows: 178935
Iteration 16: Processing 354 comments
Remaini

### Cleaning Edges and Adding attributes

Following code identifies unique relations between users and assign a weight corresponding to the amount of times two users have interacted, i.e. commented on either post.

In [None]:
distinct_edges = set()
for u, v in edges:
    if (u, v) in distinct_edges or (v, u) in distinct_edges:
        continue
    distinct_edges.add((u, v))


In [None]:
len(f'Total amount of relationships between two different users: {len(distinct_edges)}')

1011470

In [None]:
# Here we handle either direction of the edge for weighting
weighted_edges = {}
for u, v in edges:
    if (u, v) in distinct_edges:
        if (u, v) in weighted_edges:
            weighted_edges[(u, v)] += 1
        else:
            weighted_edges[(u, v)] = 1
    elif (v, u) in distinct_edges:
        if (v, u) in weighted_edges:
            weighted_edges[(v, u)] += 1
        else:
            weighted_edges[(v, u)] = 1

In [None]:
# Fortmat list for NetworkX
weighted_edges_list = [(u, v, w) for (u, v), w in weighted_edges.items()]

Appending the total amount of posts for each user.

In [None]:
for node, value in node_attributes.items():
    total_posts = 0
    for sub, posts in value['posts'].items():
        total_posts += len(posts)
    node_attributes[node]['n_posts'] = total_posts

### Populating Graph with Data

In [None]:
# Creating and populating the graph
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_weighted_edges_from(weighted_edges_list)

Code here was with AI aswell, and it has been used to make sure that the node attributes are correctly populated.

In [None]:
from itertools import islice

# display attributes for 5 nodes (prefer attributes stored on G, fallback to node_attributes)

def show_node_attrs(n=5):
    g_nodes = list(G.nodes(data=True))
    # prefer nodes with attributes in G
    if any(attrs for _, attrs in g_nodes):
        for node, attrs in islice(g_nodes, n):
            print(f"Node: {node}\nAttributes: {attrs}\n")
        return
    # fallback to node_attributes dict
    for node, attrs in islice(node_attributes.items(), n):
        print(f"Node: {node}\nAttributes: {attrs}\n")

show_node_attrs(5)

Node: 924979745.996053

Node: 924982552.262455
Attributes: {'posts': {'r/Incel': [False]}, 'subreddit_origin': 'r/Incel', 'n_posts': 1}

Node: 924982222.493196
Attributes: {'posts': {'r/Incel': [False, False, False]}, 'subreddit_origin': 'r/Incel', 'n_posts': 3}

Node: 925005710.115268
Attributes: {'posts': {'r/Incel': [False, "The fact that we don't see women as objects.", False, 'Check my history, if I fit, add me!', "A less hypocrite one. Where ugly men are confronted with the fact that they are inferior to the more aesthetic boys, because of female nature, a determining force, that will work against them throughout their whole life, at a very early age and they aren't fed with lies, they aren't praised so much by their parents (parents will be less biased). They are treated as subhumen from the start so that they won't have illusions about the future and they will be aware they will have to work much harder and their reward will be nothing compared to what their more attractive pee

In [None]:
# Populate nodes with attributes
nx.set_node_attributes(G, node_attributes)

In [None]:
# Export graph to GML
nx.write_gml(G, data_path + "reddit_graph.gml")