## Structural Similarity

This notebook builds on the previous step, where we computed aggressive language categories for every comment in the Reddit dataset. We now aim to measure the structural similarity among users who interacted through comments. To do so, we compute node embeddings for each user at a weekly level. Specifically, we create weekly snapshots of the dataset by constructing a graph that represents user interactions in each week. We use the NetworkX library to build these graphs and then apply the node2vec method to generate embeddings that capture structural similarity.

---

Install Libraries

In [2]:
pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-16.0.0-cp38-cp38-manylinux_2_28_x86_64.whl (40.8 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/40.8 MB[0m [31m139.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/40.8 MB[0m [31m143.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.1/40.8 MB[0m [31m152.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m20.5/40.8 MB[0m [31m154.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m25.6/40.8 MB[0m [31m150.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━[0m [32m31.0/40.8 MB[

In [3]:
pip install networkx==1.11 

Collecting networkx==1.11
  Downloading networkx-1.11-py2.py3-none-any.whl (1.3 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: networkx
  Attempting uninstall: networkx
    Found existing installation: networkx 3.1
    Uninstalling networkx-3.1:
      Successfully uninstalled networkx-3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikit-image 0.20.0 requires networkx>=2.8, but you have networkx 1.11 which is incompatible.[0m[31m
[0mSuccessfully installed networkx-1.11

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[

In [4]:
pip install fastnode2vec

Collecting fastnode2vec
  Downloading fastnode2vec-0.0.7-py3-none-any.whl (9.4 kB)
Collecting numba (from fastnode2vec)
  Downloading numba-0.58.1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.6/3.7 MB[0m [31m148.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m100.9 MB/s[0m eta [36m0:00:00[0m
Collecting llvmlite<0.42,>=0.41.0dev0 (from numba->fastnode2vec)
  Downloading llvmlite-0.41.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/43.6 MB[0m [31m154.7 MB/s[0m eta [36m0:00:01[0m[2

In [5]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas()
from fastnode2vec import Graph, Node2Vec
import pickle

In [6]:
import ast

In [8]:
pip install tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
from tqdm import tqdm
tqdm.pandas()

Define a series of functions to compute node embeddings and hence a structural similarity score.

In [7]:
def aggressLangProcessing(week_data):
    """
    Accesses the different toxicity categories and creates new columns in the dataframe for each category
    Parameters
    ----------
    week_data : dataframe
        dataframe for a given week of the entire data
    Returns
    ----------
    week_data : dataframe
        dataframe with additional columns for each toxicity category
    """
    week_data['toxicityScore'] = week_data['aggLangDict'].progress_apply( lambda x: 
                ast.literal_eval(str(x))['toxicity'])
    week_data['identityAttackScore'] = week_data['aggLangDict'].progress_apply( lambda x: 
                ast.literal_eval(str(x))['identity_attack'])
    week_data['insultScore'] = week_data['aggLangDict'].progress_apply( lambda x: 
                ast.literal_eval(str(x))['insult'])
    week_data['obsceneScore'] = week_data['aggLangDict'].progress_apply( lambda x: 
                ast.literal_eval(str(x))['obscene'])
    week_data['severeToxicityScore'] = week_data['aggLangDict'].progress_apply( lambda x: 
                ast.literal_eval(str(x))['severe_toxicity'])
    week_data['threatScore'] = week_data['aggLangDict'].progress_apply( lambda x: 
                ast.literal_eval(str(x))['threat'])
    return week_data

def cosine_sim(vector1, vector2):
    """
    Computes the cosine similarity score of two vectors
    Parameters
    ----------
    vector1 : node2vec embedding
        embedding of the author
    vector2 : node2vec embedding
        embedding of the receiver
    Returns
    ----------
        cosine similarity of two users
    """
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

def calculate_author_receiver_counts_with_progress(df):
    """
    Compute the count of every author-receiver pairing
    Parameters
    ----------
    df : dataframe
        embedding of the author
    Returns
    ----------
        a pandas series of every unique author-receiver pair and the count of occurances in the input dataframe
    """
    groups = df.groupby(['author', 'receiver'])
    counts = {}

    with tqdm(total=len(groups), desc="Calculating author-receiver counts") as pbar:
        for group_key, group_df in groups:
            counts[group_key] = len(group_df)
            pbar.update(1)

    return pd.Series(counts).rename('numerator')

def embeddings_convert_to_numpyarray(embeddings_file):
    """
    Convert node2vec embeddings to a numpy array for ease of computation
    Parameters
    ----------
    embeddings_file : node2vec embeddings
        node2vec embeddings
    Returns
    ----------
        a numpy array 
    """
    num_nodes = len(embeddings_file.key_to_index)
    embedding_dim = embeddings_file.vector_size
    #create a numpy array
    embeddings_array = np.zeros((num_nodes, embedding_dim))
    print(num_nodes)
    print(embedding_dim)
    print(embeddings_array.shape)
    for i, node in enumerate(embeddings_file.key_to_index):
        embeddings_array[i] = embeddings_file[node]
    return embeddings_array

def node_embeddings(week_data, week_no):
    """
    Compute the node embeddings based on a weekly snapshot of interactions on reddit, and then compute the structural similarity measure
    Parameters
    ----------
    week_data : dataframe
        the subset of the reddit data for the given week
    week_no : int
        the week's number (from 1 to 26)
    Returns
    ----------
        None
    """
    #calculate the edge
    week_data['denominator'] = week_data['receiver'].map(week_data['receiver'].value_counts())
    week_data_1 = calculate_author_receiver_counts_with_progress(week_data)
    week_data_2 = week_data_1.to_frame()
    week_data_2 = week_data_2.reset_index() 
    week_data_2 = week_data_2.rename(columns={"level_0": "author", "level_1": "receiver"})
    week_data_3 = week_data.merge(week_data_2, how='left')
    week_data_3['edge_weights'] = week_data_3['numerator']/week_data_3['denominator']
    print((week_data_3['edge_weights'] <= 1).all())

    #drop duplicates and get a new data set with just unique author-receiver pairs to form the network
    week_data_4 = week_data_3[['author','receiver','edge_weights']]
    graph_df = week_data_4.drop_duplicates(subset = ['author','receiver']).reset_index(drop = True)
    list1 = list(graph_df.itertuples(index=False, name=None))
    graph = Graph(list1, directed=True, weighted=True)
    n2v_q3_week = Node2Vec(graph, dim=128, walk_length=80, window=10, p=1.0, q=3.0, workers=3)
    print('starting the training')
    n2v_q3_week.train(epochs=25)
    model_storage_name_location = 'n2v_models/n2vmodel_week'+ week_no + '.pkl'
    # Open the file in binary write mode
    with open(model_storage_name_location, 'wb') as f:
        pickle.dump(n2v_q3_week.wv, f)
    embeddings_array = embeddings_convert_to_numpyarray(n2v_q3_week.wv)
    print(type(embeddings_array))
    #code to save the embeddings array itself.
    #with open("embeddings.pickle", "wb") as f:
        # Pickle the embeddings using the highest protocol version (optional)
        #pickle.dump(embeddings_array, f, pickle.HIGHEST_PROTOCOL)
    week_data['networkSimilarity'] = week_data.progress_apply(lambda x: cosine_sim(n2v_q3_week.wv[x.author], n2v_q3_week.wv[x.receiver]), axis=1)
    week_data1 = week_data[['id', 'subreddit', 'body', 'author', 'parent_id', 'link_id', 'receiver',
       'receiver_body', 'date', 'toxicityScore',
       'identityAttackScore', 'insultScore', 'obsceneScore',
       'severeToxicityScore', 'threatScore', 'week', 'networkSimilarity']]
    resulting_file_name = 'netSim_processed/mayjune_w'+week_no+'_processed.parquet'
    week_data1.to_parquet(resulting_file_name)

    print('completed!')

In [None]:
###########modify the following


#node_embeddings('janfeb_w2','2')
#read the processed file with network similarity
#file_name1 = 'janfeb_w3'
#file_path1 = 'JanFebSubreddits2022/weekly_data/' + file_name1 + '.parquet'
#week_data2 = pd.read_parquet(file_path1)
#week_data2.head(2)

#print(len(week_data2))

#read the node2vec model
#file_name1 = 'n2vmodel_week3'
#file_path1 = 'JanFebSubreddits2022/weekly_data/' + file_name1 + '.pkl'
#read_model = pd.read_pickle(file_path1)
#read_model['automoderator']

To illustrate, the functions have been called to process the data for one week.

## Process Week 18.

In [None]:
w18_part1 = pd.read_parquet('weekly_data/MayJune_w18_part1')
w18_part2 = pd.read_parquet('weekly_data/MayJune_w18_part2')
w18 = pd.concat([w18_part1,w18_part2], ignore_index=True, axis=0)
print(len(w18))
w18.head(2)

In [None]:
w18 = aggressLangProcessing(w18)
print(len(w18))
w18.head(2)

In [12]:
w18['subreddit'].value_counts()

In [13]:
w18['date'].value_counts()

In [14]:
node_embeddings(w18,'18')

True
starting the training
248641
128
(248641, 128)
<class 'numpy.ndarray'>
completed!


Calculating author-receiver counts:   0%|          | 0/887992 [00:00<?, ?it/s]Calculating author-receiver counts:   0%|          | 1/887992 [00:06<1635:48:32,  6.63s/it]Calculating author-receiver counts:   1%|          | 6390/887992 [00:06<10:54, 1347.41it/s]Calculating author-receiver counts:   1%|▏         | 12885/887992 [00:06<04:32, 3207.37it/s]Calculating author-receiver counts:   2%|▏         | 19413/887992 [00:06<02:32, 5696.55it/s]Calculating author-receiver counts:   3%|▎         | 25925/887992 [00:07<01:36, 8916.28it/s]Calculating author-receiver counts:   4%|▎         | 31859/887992 [00:07<01:09, 12266.52it/s]Calculating author-receiver counts:   4%|▍         | 37317/887992 [00:07<00:53, 15883.38it/s]Calculating author-receiver counts:   5%|▍         | 43400/887992 [00:07<00:40, 20996.58it/s]Calculating author-receiver counts:   6%|▌         | 48882/887992 [00:07<00:34, 24490.97it/s]Calculating author-receiver counts:   6%|▌         | 53910/887992 [00:07<00:30, 2

In [1]:
#read the processed file with network similarity
#week_data18 = pd.read_parquet('netSim_processed/mayjune_w18_processed.parquet')
#week_data18.head(2)