# **Creating the NetworkX graph for Fifteen subreddits, Generating the Node Embeddings, and Calculating the Network Similarity**

This notebook considers 15 subreddits and generates the NetworkX graphs.

The interaction between user 'i' and user 'j' is captured during the following metric-

*(Number of comments between i and j)/(Number of comments sent by all users to j)*

---

**Part 1: Reading the data**<br>
**Part 2: Generate the Graph using Method 2a**<br>
  The unit level is a submission. The metric is aggregrated across all submissions from all subreddits<br>
  Output file generated: 'data_fifteen_subreddits_method2a_node2vec.csv'<br>
**Part 3: Generate the Graph using Method 2b**<br>
  The unit level is a subreddit. The metric is aggregrated across all subreddits<br>
  Output file generated: 'data_fifteen_subreddits_method2b_node2vec.csv'<br>
**Part 4: Generate the Graph using Method 2c**<br>
  The unit level is the entire reddit data<br>
  Output file generated: 'data_fifteen_subreddits_method2c_node2vec.csv'<br>

### note: The column 'cosine_similarity' in all output files is actually the 'Network Similarity'
---

OUTPUT FILE:<br>
1. 'data_fifteen_subreddits_method2a_node2vec.csv': contains the unique user pairs and their network similarity (cosine similarity between node embeddings)
2. 'data_fifteen_subreddits_method2b_node2vec.csv': contains the unique user pairs and their network similarity (cosine similarity between node embeddings)
3. 'data_fifteen_subreddits_method2c_node2vec.csv': contains the unique user pairs and their network similarity (cosine similarity between node embeddings)
.


.


---
# **Part 1: Reading the data**

In this section, I have read the 15 subreddits.

---
.

Check if cuda is being used

In [None]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cuda.


Connect to drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Read the data containing fifteen subreddits

In [None]:
import pandas as pd
data_fifteen_subreddits = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits.csv', low_memory=False)
print(len(data_fifteen_subreddits)) #length of data = 107352
print(len(pd.unique(data_fifteen_subreddits['subreddit_id']))) #number of subreddits considered = 16 #but is 15
print(len(pd.unique(data_fifteen_subreddits['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data_fifteen_subreddits['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_fifteen_subreddits['link_id']))) #number of submissions =
print(len(data_fifteen_subreddits.columns))

107352
16
107351
50325
6156
17


---
# **Part 2: Generate the Graph using Method 2a**<br>

---


  The unit level is a submission. The metric is aggregrated across all submissions from all subreddits
.

Create an empty dataframe to store the edge weights geenrated using method 2a

In [None]:
data_method2a = pd.DataFrame(columns=['subreddit_id','submission_id','from_user','to_user','edgeweight_method2a'])
print(data_method2a)

Empty DataFrame
Columns: [subreddit_id, submission_id, from_user, to_user, edgeweight_method2a]
Index: []


if more than one subreddit is being inputted, then enter 'subreddit_id' parameter as " "

In [None]:
def method2a_function(subreddit_id, subreddit_data):
  data_method2a = pd.DataFrame(columns=['subreddit_id','submission_id','from_user','to_user','edgeweight_method2a'])
  #print(data_method2a)

  #obtain the link_id's of all submission in the subreddit
  submissions_list = subreddit_data['link_id'].unique()
  #obtain the names of all user's in the subreddit
  #author_list = subreddit_data['author'].unique()
  #print(author_list[0])
  #print(author_list[1])

  if subreddit_id != "":
    print("Consider subreddit with ID: ",subreddit_id)
  print("total number of submission: ",len(submissions_list))
  print("total number of comments in entire subreddit: ",len(subreddit_data['id'].unique()))
  counter = 0
  ignore_comments_counter = 0

  #iterate across all submissions
  for i in submissions_list:
    print('\n***********************')
    counter += 1
    curr_link_id = i
    print("consider submission with link_id = "+str(curr_link_id)+ " ("+str(counter)+"/"+str(len(submissions_list))+")")

    #obtain all comments made on this submission from the subreddit
    submission_comments = subreddit_data.loc[subreddit_data['link_id'] == curr_link_id]
    #print(submission_comments.head(3))

    #obtain the names of all user's in the subsmission
    author_list = submission_comments['author'].unique()
    print("total number of authors: ",len(author_list))

    #number of comments on the subreddit
    tot_comments = len(submission_comments)
    print("total number of comments in this submission = ", tot_comments)
    j = 0

    #iterate across the current submission
    for index, row in submission_comments.iterrows():
      j += 1
      #print('comment '+str(j)+" ---------------")

      curr_author = row['author']
      curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters
      #print(curr_parent_comment_id)

      #identify if a comment is a primary comment
      primary_comment_flag = 0
      if row['parent_id'] == row['link_id']: #it is a primary comment
        #print("primary comment")
        primary_comment_flag = 1

      #rule out any comment that does not have a valid parent in the submission
      #(the parent id of a given comment should be either (i) the link id itslef for a primary comment
      # or
      # the id of another comment in the submission)
      if(len(submission_comments[submission_comments['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
        #print(primary_comment_flag)
        #if (primary_comment_flag == 1):
          #print('this comment is a reply to the submission and hence will not have an edge')
        #else:
          #print('this comment was made as a reply to another comment which cannot be found in the data')
        #print('there')
        ignore_comments_counter += 1
        continue

      primary_comment_flag = 0
      #print('Found a valid parent comment in the submission')
      curr_parent = submission_comments[submission_comments['id'] == curr_parent_comment_id]['author'].values[0]
      #print('The parent comment was made by author: '+str(curr_parent))
      curr_comment_id = row['id']

      all_replies_to_parent_df = submission_comments[(submission_comments["parent_id"] == row['parent_id'])]
      if len(all_replies_to_parent_df) == 0:
        #print('reached here 1')
        ignore_comments_counter += 1
        continue
      curr_author_all_replies_to_parent_df = all_replies_to_parent_df[(all_replies_to_parent_df["author"] == curr_author)]
      if len(curr_author_all_replies_to_parent_df) == 0:
        #print('reached here 2')
        ignore_comments_counter += 1
        continue
      else:
        #print("edge weight for this comment: "+str(len(curr_author_all_replies_to_parent_df))+"/"+str(len(all_replies_to_parent_df)))
        #weighted_interaction_between_curr_and_curr_parent = float("{:.2f}".format((len(curr_author_all_replies_to_parent_df)/len(all_replies_to_parent_df))))
        weighted_interaction_between_curr_and_curr_parent = (len(curr_author_all_replies_to_parent_df)/len(all_replies_to_parent_df))
        #add the row to the dataframe
        if len(data_method2a[(data_method2a['from_user'] == curr_author) & (data_method2a['to_user'] == curr_parent) & (data_method2a['submission_id'] == curr_link_id)].values) > 0: #there exists a row with the curr_aithor to curr_parent in the same submission already
          ignore_comments_counter += 1
          continue
        else:
          subreddit_id = row['subreddit_id']
          data_method2a.loc[len(data_method2a.index)] = [subreddit_id, curr_link_id, curr_author, curr_parent, weighted_interaction_between_curr_and_curr_parent]
          #print('length of submission data frame: '+str(len(data_method2a)))

    #data_method2a.append(data_method2a_local,ignore_index=True)
    #print(data_method2a_local)
    #print('appended, new length of data frame: '+str(len(data_method2a)))
  print('total number of comments ignored: ' +str(ignore_comments_counter))
  return data_method2a


data_method2a = method2a_function("",data_fifteen_subreddits)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
***********************
consider submission with link_id = t3_5llk77 (5157/6156)
total number of authors:  3
total number of comments in this submission =  3

***********************
consider submission with link_id = t3_5llr87 (5158/6156)
total number of authors:  6
total number of comments in this submission =  6

***********************
consider submission with link_id = t3_5llsah (5159/6156)
total number of authors:  55
total number of comments in this submission =  83

***********************
consider submission with link_id = t3_5lllyx (5160/6156)
total number of authors:  3
total number of comments in this submission =  3

***********************
consider submission with link_id = t3_5llm77 (5161/6156)
total number of authors:  4
total number of comments in this submission =  6

***********************
consider submission with link_id = t3_5llsmq (5162/6156)
total number of authors:  11
total number of comments in 

In [None]:
print(len(data_method2a))
print(data_method2a.head(2))

54469
  subreddit_id submission_id      from_user     to_user  edgeweight_method2a
0      t5_22i0     t3_5lc6zb         CR1986     Dukelix                  1.0
1      t5_22i0     t3_5lc6zb  NinjaPizzaCat  seewolfmdk                  1.0


Create another empty dataframe to aggregrate repeated user-pairs across submissions

In [None]:
data_method2a_final = pd.DataFrame(columns=['from_user','to_user','edgeweight_method2a'])
print(data_method2a_final)

Empty DataFrame
Columns: [from_user, to_user, edgeweight_method2a]
Index: []


create a new dataframe to aggregate across all submissions

In [None]:
data_method2a_final = data_method2a.groupby(['submission_id','from_user', 'to_user'], as_index=False)['edgeweight_method2a'].mean()

In [None]:
print(len(data_method2a_final))
print(data_method2a_final.columns)
print(data_method2a_final.head(3))

54469
Index(['submission_id', 'from_user', 'to_user', 'edgeweight_method2a'], dtype='object')
  submission_id     from_user          to_user  edgeweight_method2a
0     t3_4y2tcs   Harden-Soul          cabbeer                  1.0
1     t3_4y779v  General_Fear  the_strasburger                  1.0
2     t3_4ys9yd        ashaw7    victor_knight                  1.0


In [None]:
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt

In [None]:
G_method2a = nx.from_pandas_edgelist(data_method2a_final, "from_user", "to_user", edge_attr="edgeweight_method2a", create_using=nx.DiGraph()) #weight for graph not set

code below to visualize the graph takes a while to run

In [None]:
from matplotlib.pyplot import text
plt.figure(figsize=(8, 8))
pos = nx.spring_layout(G_method2a, k=0.61)  # For better example looking
d = dict(G_method2a.degree)
labels = {e: G_method2a.edges[e]['edgeweight_method2a'] for e in G_method2a.edges}
#nx.draw(G_method2a, pos,with_labels=True)
#nx.draw(G_method2a, pos)
nx.draw_networkx_edges(G_method2a, pos, alpha=0.8)
nx.draw_networkx_nodes(G_method2a, pos, node_size=2, node_color="b")
#nx.draw_networkx_edge_labels(G_method2a, pos, edge_labels=labels, font_size = 5)
#nx.draw_networkx_labels(G_method2a, pos, labels=labels, font_size = 5)

plt.show()

In [None]:
print(G_method2a.number_of_nodes())
print(G_method2a.number_of_edges()) #edges same as number of rows
print(np.mean([d for _, d in G_method2a.degree()])) #average degree of nodes
print(G_method2a.size(weight='edgeweight_method2a'))

28264
53326
3.773422020945372
33540.93378498311


In [None]:
pip install node2vec

Collecting node2vec
  Downloading node2vec-0.4.6-py3-none-any.whl (7.0 kB)
Collecting networkx<3.0,>=2.5 (from node2vec)
  Downloading networkx-2.8.8-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: networkx, node2vec
  Attempting uninstall: networkx
    Found existing installation: networkx 3.1
    Uninstalling networkx-3.1:
      Successfully uninstalled networkx-3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.[0m[31m
[0mSuccessfully installed networkx-2.8.8 node2vec-0.4.6


In [None]:
from node2vec import Node2Vec as n2v

takes about 5 min to run on a v100 gpu

In [None]:
WINDOW = 3 # Node2Vec fit window
MIN_COUNT = 1 # Node2Vec min. count
BATCH_WORDS = 4 # Node2Vec batch words

g_emb = n2v(G_method2a,dimensions=10)

mdl2 = g_emb.fit(
    vector_size = 128,
    window=WINDOW,
    min_count=MIN_COUNT,
    batch_words=BATCH_WORDS
)

Computing transition probabilities:   0%|          | 0/28264 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [00:48<00:00,  4.89s/it]


information about generated embeddings

In [None]:
print(len(mdl2.wv)) #print the length of the embeddings generated = 28264 which is the number of nodes
print(len(mdl2.wv[0])) #length of each embedding is 128
print(mdl2.wv[0]) #print the first embedding

28264
128
[-0.33822307  0.1810318  -0.26447213  0.89019746 -0.5043572  -0.21629845
 -0.45103353  0.02191189  0.4581968  -0.30041605  0.3469439   0.02185048
 -0.16115557  0.15827274 -0.05883258 -0.00856226 -0.29456195 -0.23424293
  0.10861116 -0.25548974 -0.05279021 -0.10582004 -0.22255486 -0.1468608
  0.14261645 -0.10422887  0.1128809  -0.21484789 -0.79538417 -0.05473675
  0.49935836 -0.5266371   0.25418717  0.2685348   0.5183924  -0.04847993
 -0.28836653  0.03186553 -0.12763806  0.07498194 -0.5010787   0.04601716
 -0.141891    0.2917946   0.36924398 -0.5282059  -0.13543415 -0.05546651
  0.21424343  0.33141237  0.2698647  -0.142909    0.07132612  0.22982118
  0.15080106  0.5808087   0.2050143  -0.15507324  0.31838143 -0.19453602
 -0.07318855  0.33704028  0.03219314  0.10914564  0.70410895 -0.13480465
 -0.06276549  0.02874995 -0.07599476  0.00628846  0.4143304   0.43002883
  0.51814014 -0.24169785  0.53005564 -0.51944834 -0.04543479  0.25571266
  0.10792958 -0.28635833  0.1389921  -0.10

print the node embedding for a given node

In [None]:
mdl2.wv['Madjura']

array([-0.08505617,  0.04920644,  0.06581379,  0.28907484, -0.41281283,
       -0.15990771, -0.19128779, -0.13824372,  0.06693247, -0.06882457,
        0.13493234,  0.08240538, -0.12780212, -0.0234277 ,  0.266124  ,
        0.05629351, -0.15513372, -0.10359576, -0.18863468,  0.03245437,
       -0.24512553, -0.06331603, -0.09945363,  0.08095748,  0.09669771,
       -0.06099385,  0.1392944 , -0.1842385 , -0.12451602,  0.08805975,
        0.18567193, -0.21253149, -0.00224559,  0.08847424,  0.18661666,
        0.16585422,  0.0138135 , -0.03930105, -0.05086125, -0.06706466,
       -0.42813912, -0.27818772, -0.07309095,  0.12809572,  0.15994684,
       -0.29039183,  0.16346125,  0.29525712,  0.32718462,  0.2452815 ,
        0.00265643,  0.04248144, -0.01006392, -0.0965393 , -0.06642368,
        0.11685839,  0.12831707, -0.0312924 ,  0.09335306,  0.01371595,
       -0.10358334,  0.3208729 ,  0.16143644, -0.4167219 ,  0.2586411 ,
        0.19273193, -0.16883975, -0.27366075,  0.17496695, -0.09

function to find cosine similarity

In [None]:
import numpy as np
def cosine_sim(vector1, vector2):
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

In [None]:
x = mdl2.wv['Madjura']
y = mdl2.wv['waboz']
print(cosine_sim(x, y))

0.85531545


print embeddings most similar to a given node

In [None]:
comment_id = 'Madjura'
for s in mdl2.wv.most_similar(comment_id, topn = 10):
    print(s)

('waboz', 0.8553155064582825)
('BrennanBr', 0.8514754176139832)
('TDSquared', 0.8452527523040771)
('fancythenancy', 0.8352042436599731)
('Swerdman55', 0.8321106433868408)
('iwannabuyit', 0.8287559151649475)
('Uncle_Sams_Cabin', 0.8283258080482483)
('cheekymonkey2005', 0.8239148259162903)
('RoosterCoops', 0.8230941295623779)
('itsallabigshow', 0.8229793310165405)


store the generated embeddings

Create a new dataframe to store the cosine similarity between the sender node and the receiver node, since it is method 2a we have access to the submission id of the user pair, so that is also included in the dataframe

In [None]:
data_method2a_node2vec = pd.DataFrame(columns=['submission_id','from_user','to_user','edgeweight_method2a','cosine_similarity'])
print(data_method2a_node2vec)

Empty DataFrame
Columns: [submission_id, from_user, to_user, edgeweight_method2a, cosine_similarity]
Index: []


In [None]:
i = 0
print('generating cosine similairty for '+str(len(data_method2a_final))+' user pairs'+'\n')

for ind, row in data_method2a_final.iterrows():

  i += 1
  if i % 1000 == 0:
    print('finished '+ str(i)+'/'+str(len(data_method2a_final))+' user pairs')

  #get user pair and calculate the cosin similarity between their node embeddings
  from_user = row['from_user']
  to_user = row['to_user']
  from_user_embedding = mdl2.wv[from_user]
  from_to_embedding = mdl2.wv[to_user]

  cos_sim_between_from_to_user = cosine_sim(from_user_embedding, from_to_embedding)

  curr_submission_id = row['submission_id']
  curr_edge_weight = row['edgeweight_method2a']

  data_method2a_node2vec.loc[len(data_method2a_node2vec.index)] = [curr_submission_id, from_user, to_user, curr_edge_weight, cos_sim_between_from_to_user]


generating cosine similairty for 54469 user pairs

finished 1000/54469 user pairs
finished 2000/54469 user pairs
finished 3000/54469 user pairs
finished 4000/54469 user pairs
finished 5000/54469 user pairs
finished 6000/54469 user pairs
finished 7000/54469 user pairs
finished 8000/54469 user pairs
finished 9000/54469 user pairs
finished 10000/54469 user pairs
finished 11000/54469 user pairs
finished 12000/54469 user pairs
finished 13000/54469 user pairs
finished 14000/54469 user pairs
finished 15000/54469 user pairs
finished 16000/54469 user pairs
finished 17000/54469 user pairs
finished 18000/54469 user pairs
finished 19000/54469 user pairs
finished 20000/54469 user pairs
finished 21000/54469 user pairs
finished 22000/54469 user pairs
finished 23000/54469 user pairs
finished 24000/54469 user pairs
finished 25000/54469 user pairs
finished 26000/54469 user pairs
finished 27000/54469 user pairs
finished 28000/54469 user pairs
finished 29000/54469 user pairs
finished 30000/54469 user pair

In [None]:
print(len(data_method2a_node2vec))
print(data_method2a_node2vec.columns)
print(data_method2a_node2vec.head(3))

54469
Index(['submission_id', 'from_user', 'to_user', 'edgeweight_method2a',
       'cosine_similarity'],
      dtype='object')
  submission_id     from_user          to_user  edgeweight_method2a  \
0     t3_4y2tcs   Harden-Soul          cabbeer                  1.0   
1     t3_4y779v  General_Fear  the_strasburger                  1.0   
2     t3_4ys9yd        ashaw7    victor_knight                  1.0   

   cosine_similarity  
0           0.824062  
1           0.931396  
2           0.796708  


In [None]:
data_method2a_node2vec.to_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_method2a_node2vec.csv',index=False)

---
# **Part 3: Generate the Graph using Method 2b**<br>

---


 The unit level is a subreddit. The metric is aggregrated across all subreddits

Create an empty dataframe to store the edge weights geenrated using method 2b

In [None]:
data_method2b = pd.DataFrame(columns=['subreddit_id','submission_id','from_user','to_user','num', 'denom'])
print(data_method2b)

Empty DataFrame
Columns: [subreddit_id, submission_id, from_user, to_user, num, denom]
Index: []


if more than one subreddit is being inputted, then enter 'subreddit_id' parameter as " "

In [None]:
def method2b_function(subreddit_id, subreddit_data):
  submissions_list = subreddit_data['link_id'].unique()

  if subreddit_id != "":
    print("Consider subreddit with ID: ",subreddit_id)
  print("total number of submission: ",len(submissions_list))
  counter = 0
  ignore_comments_counter =  0

  #additional code to resolve an error
  type_base = type(subreddit_data['parent_id'].iloc[0])

  #number of comments on the subreddit
  tot_comments = len(subreddit_data)
  print("total number of comments in this submission = ", tot_comments)

  i = 0

  #iterate across the current submission
  for index, row in subreddit_data.iterrows():

      i += 1
      if i % 10000 == 0:
        print('comment '+str(i)+'/'+str(len(subreddit_data)))

      curr_link_id = row['link_id']
      curr_author = row['author']
      curr_subreddit = row['subreddit_id']
      if type(row['parent_id']) != type_base:
        continue
      curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters


      if(len(subreddit_data[subreddit_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
        ignore_comments_counter += 1
        continue

      #print('Found a valid parent comment in the submission')
      curr_parent = subreddit_data[subreddit_data['id'] == curr_parent_comment_id]['author'].values[0]

      all_replies_to_parent_df = subreddit_data[(subreddit_data["parent_id"] == row['parent_id'])]
      if len(all_replies_to_parent_df) == 0:
        ignore_comments_counter += 1
        continue
      curr_author_all_replies_to_parent_df = all_replies_to_parent_df[(all_replies_to_parent_df["author"] == curr_author)]
      if len(curr_author_all_replies_to_parent_df) == 0:
        ignore_comments_counter += 1
        continue
      else:
        if len(data_method2b[(data_method2b['from_user'] == curr_author)&(data_method2b['to_user'] == curr_parent)& (data_method2b['subreddit_id'] == curr_subreddit_id)].values) > 0: #there exists a row with the curr_author to curr_parent in the same reddit already
          ignore_comments_counter += 1
          continue
        else:
          subreddit_id = row['subreddit_id']
          data_method2b.loc[len(data_method2b.index)] = [subreddit_id, curr_link_id, curr_author, curr_parent,len(curr_author_all_replies_to_parent_df), len(all_replies_to_parent_df) ]

  print('total number of comments ignored: ' +str(ignore_comments_counter))
  return data_method2b

data_method2b = method2b_function("",data_fifteen_subreddits)


total number of submission:  6156
total number of comments in this submission =  107352
comment 10000/107352
comment 20000/107352
comment 30000/107352
comment 40000/107352
comment 50000/107352
comment 60000/107352
comment 70000/107352
comment 80000/107352
comment 90000/107352
comment 100000/107352
total number of comments ignored: 44377


In [None]:
data_method2b_final = pd.DataFrame(columns=['from_user','to_user','num','denom'])
print(data_method2b_final)

Empty DataFrame
Columns: [from_user, to_user, num, denom]
Index: []


Create another dataframe to aggregrate sums across all subreddits

In [None]:
data_method2b_final = data_method2b.groupby(['subreddit_id','from_user', 'to_user'], as_index=False).agg({'num':'sum','denom':'sum'})

In [None]:
print(len(data_method2b_final))
print(data_method2b_final.columns)
print(data_method2b_final.head(140))

53367
Index(['subreddit_id', 'from_user', 'to_user', 'num', 'denom'], dtype='object')
    subreddit_id      from_user         to_user  num  denom
0        t5_22i0      -Calidro-         niedrig    1      1
1        t5_22i0           -KR-          beerde    1      1
2        t5_22i0           -to-   boilersuthere    1      3
3        t5_22i0  0xKaishakunin     Auswaschbar    1      2
4        t5_22i0  0xKaishakunin       EinDenker    1      3
..           ...            ...             ...  ...    ...
135      t5_22i0   Ausrufepunkt          Kashik    1      3
136      t5_22i0   Ausrufepunkt          Kouzai    1      1
137      t5_22i0   Ausrufepunkt        Le_Cooke    1      2
138      t5_22i0   Ausrufepunkt       LittleLui    3      3
139      t5_22i0   Ausrufepunkt  MatzedieFratze    1      1

[140 rows x 5 columns]


Calculate edge weights

In [None]:
data_method2b_final['edgeweight_method2b'] = data_method2b_final['num']/data_method2b_final['denom']

In [None]:
print(len(data_method2b_final))
print(data_method2b_final.head(10))
print(data_method2b_final['subreddit_id'].unique()) #15 subreddits

53367
  subreddit_id      from_user             to_user  num  denom  \
0      t5_22i0      -Calidro-             niedrig    1      1   
1      t5_22i0           -KR-              beerde    1      1   
2      t5_22i0           -to-       boilersuthere    1      3   
3      t5_22i0  0xKaishakunin         Auswaschbar    1      2   
4      t5_22i0  0xKaishakunin           EinDenker    1      3   
5      t5_22i0  0xKaishakunin  IdenticalHandTwins    1      2   
6      t5_22i0  0xKaishakunin         Jay_Quellin    1      1   
7      t5_22i0  0xKaishakunin             SirLoki    1      2   
8      t5_22i0  0xKaishakunin            mamo1893    2      3   
9      t5_22i0  0xKaishakunin        omfgwallhax2    1      2   

   edgeweight_method2b  
0             1.000000  
1             1.000000  
2             0.333333  
3             0.500000  
4             0.333333  
5             0.500000  
6             1.000000  
7             0.500000  
8             0.666667  
9             0.500000  
['t

In [None]:
G_method2b = nx.from_pandas_edgelist(data_method2b_final, "from_user", "to_user", edge_attr="edgeweight_method2b", create_using=nx.DiGraph()) #weight for graph not set

In [None]:
pip install Node2Vec



In [None]:
from node2vec import Node2Vec as n2v

code below to visualize the graph takes a while to run

In [None]:
pos1 = nx.spring_layout(G_method2b, k=1)  # For better example looking
nx.draw(G_method2b, pos1, with_labels=True)
labels1 = {e: G_method2b.edges[e]['edgeweight_method2b'] for e in G_method2b.edges}
nx.draw_networkx_edge_labels(G_method2b, pos1, edge_labels=labels1)
plt.show()

In [None]:
print(G_method2b.number_of_nodes()) #no.of nodes
print(G_method2b.number_of_edges()) #edges same as number of rows
print(np.mean([d for _, d in G_method2b.degree()])) #average degree of nodes
print(G_method2b.size(weight='edgeweight_method2b'))

28264
53326
3.773422020945372
33542.48149549543


the code took about 5 min to run using the V100 gpu

In [None]:
WINDOW = 3 # Node2Vec fit window
MIN_COUNT = 1 # Node2Vec min. count
BATCH_WORDS = 4 # Node2Vec batch words

g_emb2b = n2v(G_method2b,dimensions=10)

mdl2b = g_emb2b.fit(
    vector_size = 128,
    window=WINDOW,
    min_count=MIN_COUNT,
    batch_words=BATCH_WORDS
)

Computing transition probabilities:   0%|          | 0/28264 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [00:48<00:00,  4.88s/it]


information about generated embeddings

In [None]:
print(len(mdl2b.wv)) #print the length of the embeddings generated = 28264 which is the number of nodes
print(len(mdl2b.wv[0])) #length of each embedding is 128
print(mdl2b.wv[0]) #print the first embedding

28264
128
[ 0.00347254  0.08085492  0.03798684  0.21070755  0.466869    0.06075256
  0.38792282  0.19309138  0.26022932 -0.3163212   0.20014359  0.1344779
  0.01099483  0.30102324 -0.32120386 -0.29216066 -0.02909642 -0.10546302
 -0.32035035 -0.581886    0.55364996  0.13971843 -0.3010873   0.085507
  0.19837342  0.368842    0.02387394 -0.19341174  0.05263124  0.624974
 -0.44761485 -0.16804436  0.01907294 -0.1741592   0.26527414  0.03433206
  0.5453702  -0.2657168  -0.10032845  0.6755702  -0.27845064 -0.4004729
 -0.17062283 -0.1250079  -0.14375368 -0.2001342  -0.0501754  -0.20400839
 -0.14478551  0.4861296  -0.16867335 -0.18280265  0.2278261   0.07849106
  0.7835564   0.10752746  0.15556824  0.11329722 -0.1577164  -0.09949001
 -0.12021227 -0.05560232  0.08119474  0.07605154 -0.05694162 -0.0783243
 -0.44530186 -0.05902959  0.13531138  0.14994289 -0.077526   -0.21439265
 -0.6135226  -0.38301075  0.4527517   0.17502806  0.28964454 -0.43840334
 -0.33461523 -0.20946546  0.1189945  -0.4549514 

print the node embedding for a given node

In [None]:
mdl2b.wv['Madjura']

array([-0.1454237 ,  0.06207783,  0.03120392, -0.00568491,  0.2182988 ,
       -0.15304774,  0.12427642,  0.04810417, -0.01957378, -0.20886025,
       -0.00435185,  0.196402  ,  0.01506579,  0.22539595,  0.04412238,
       -0.01962706, -0.19747691, -0.0253114 , -0.15811306, -0.1767917 ,
        0.09439678,  0.12110768, -0.12005102,  0.12205044, -0.04797857,
        0.05284562, -0.28687862, -0.33101302, -0.03049977,  0.0948332 ,
       -0.09493613,  0.00199447,  0.09233416,  0.11640783,  0.15350369,
       -0.10997519, -0.18523501, -0.2152639 ,  0.06004549,  0.11789213,
       -0.14397484,  0.00817098,  0.00516457,  0.42681918, -0.11585851,
       -0.18422166,  0.00283162, -0.03074341, -0.14102392, -0.10245258,
       -0.02147532, -0.10789565, -0.04791338, -0.04138652,  0.03908789,
       -0.14385746, -0.08379576,  0.03132664, -0.13485712, -0.05709871,
       -0.14143547,  0.27408612,  0.06428565, -0.04644211, -0.06460852,
        0.2298045 , -0.1303745 ,  0.15633416,  0.06146976, -0.02

function to find cosine similarity

In [None]:
import numpy as np
def cosine_sim(vector1, vector2):
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

In [None]:
x = mdl2b.wv['Madjura']
z = mdl2.wv['Madjura']
y = mdl2b.wv['waboz']
print(cosine_sim(x, y))
print(cosine_sim(z, x)) #difference between node embedding of mathod 2a with 2b

0.7623948
0.22523181


print embeddings most similar to a given node

In [None]:
comment_id = 'Madjura'
for s in mdl2b.wv.most_similar(comment_id, topn = 10):
    print(s)

('Drengskapr_', 0.8839974999427795)
('So_we_fo_o_fo', 0.8776150941848755)
('shmoozy', 0.8763810992240906)
('NoxiousPluK', 0.8762131929397583)
('JamaicanBoySmith', 0.8748509287834167)
('sonicman420', 0.873619794845581)
('Andruw25', 0.8732419013977051)
('paulvs88', 0.8705782294273376)
('BrennanBr', 0.8692038655281067)
('kairho', 0.8686172366142273)


store the generated embeddings

Create a new dataframe to store the cosine similarity between the sender node and the receiver node, since it is method 2b we do not have access to the submission id of the user pair, we do not include that in the dataframe

In [None]:
data_method2b_node2vec = pd.DataFrame(columns=['subreddit_id','from_user','to_user','edgeweight_method2b','cosine_similarity'])
print(data_method2b_node2vec)

Empty DataFrame
Columns: [subreddit_id, from_user, to_user, edgeweight_method2b, cosine_similarity]
Index: []


In [None]:
i = 0
print('generating cosine similarity for '+str(len(data_method2b_final))+' user pairs'+'\n')

for ind, row in data_method2b_final.iterrows():

  i += 1
  if i % 1000 == 0:
    print('finished '+ str(i)+'/'+str(len(data_method2b_final))+' user pairs')

  #get user pair and calculate the cosin similarity between their node embeddings
  from_user = row['from_user']
  to_user = row['to_user']
  from_user_embedding = mdl2b.wv[from_user]
  from_to_embedding = mdl2b.wv[to_user]

  cos_sim_between_from_to_user = cosine_sim(from_user_embedding, from_to_embedding)

  curr_edge_weight = row['edgeweight_method2b']
  curr_subreddit_id = row['subreddit_id']

  data_method2b_node2vec.loc[len(data_method2b_node2vec.index)] = [curr_subreddit_id, from_user, to_user, curr_edge_weight, cos_sim_between_from_to_user]

generating cosine similarity for 53326 user pairs

finished 1000/53326 user pairs
finished 2000/53326 user pairs
finished 3000/53326 user pairs
finished 4000/53326 user pairs
finished 5000/53326 user pairs
finished 6000/53326 user pairs
finished 7000/53326 user pairs
finished 8000/53326 user pairs
finished 9000/53326 user pairs
finished 10000/53326 user pairs
finished 11000/53326 user pairs
finished 12000/53326 user pairs
finished 13000/53326 user pairs
finished 14000/53326 user pairs
finished 15000/53326 user pairs
finished 16000/53326 user pairs
finished 17000/53326 user pairs
finished 18000/53326 user pairs
finished 19000/53326 user pairs
finished 20000/53326 user pairs
finished 21000/53326 user pairs
finished 22000/53326 user pairs
finished 23000/53326 user pairs
finished 24000/53326 user pairs
finished 25000/53326 user pairs
finished 26000/53326 user pairs
finished 27000/53326 user pairs
finished 28000/53326 user pairs
finished 29000/53326 user pairs
finished 30000/53326 user pair

In [None]:
print(len(data_method2b_node2vec))
print(data_method2b_node2vec.columns)
print(data_method2b_node2vec.head(3))
print(data_method2b_node2vec['subreddit_id'].unique())

53326
Index(['subreddit_id', 'from_user', 'to_user', 'edgeweight_method2b',
       'cosine_similarity'],
      dtype='object')
  subreddit_id  from_user        to_user  edgeweight_method2b  \
0      t5_22i0  -Calidro-        niedrig             1.000000   
1      t5_22i0       -KR-         beerde             1.000000   
2      t5_22i0       -to-  boilersuthere             0.333333   

   cosine_similarity  
0           0.980658  
1           0.939030  
2           0.978517  
['t5_22i0' 't5_2qh33' 't5_2qhwp' 't5_2qm35' 't5_2qmpb' 't5_2qo4s'
 't5_2r2jt' 't5_2r4oc' 't5_2ror6' 't5_2scss' 't5_2sgoq' 't5_2sjgc'
 't5_2vbli' 't5_2wm0g' 't5_3deqz']


In [None]:
data_method2b_node2vec.to_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_method2b_node2vec.csv',index=False)

---
# **Part 4: Generate the Graph using Method 2c**<br>

---


 The unit level is the entire reddit data

Create an empty dataframe to store the edge weights geenrated using method 2c

In [None]:
data_method2c = pd.DataFrame(columns=['subreddit_id','submission_id','from_user','to_user','num', 'denom'])
print(data_method2c)

Empty DataFrame
Columns: [subreddit_id, submission_id, from_user, to_user, num, denom]
Index: []


In [None]:
def method2c_function(subreddit_id, subreddit_data):
  submissions_list = subreddit_data['link_id'].unique()

  if subreddit_id != "":
    print("Consider subreddit with ID: ",subreddit_id)
  print("total number of submission: ",len(submissions_list))
  counter = 0
  ignore_comments_counter =  0

  #additional code to resolve an error
  type_base = type(subreddit_data['parent_id'].iloc[0])

  #number of comments on the subreddit
  tot_comments = len(subreddit_data)
  print("total number of comments in this submission = ", tot_comments)

  i = 0

  #iterate across the current submission
  for index, row in subreddit_data.iterrows():

      i += 1
      if i % 10000 == 0:
        print('comment '+str(i)+'/'+str(len(subreddit_data)))

      curr_link_id = row['link_id']
      curr_author = row['author']
      if type(row['parent_id']) != type_base:
        continue
      curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters


      if(len(subreddit_data[subreddit_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
        ignore_comments_counter += 1
        continue

      #print('Found a valid parent comment in the submission')
      curr_parent = subreddit_data[subreddit_data['id'] == curr_parent_comment_id]['author'].values[0]

      all_replies_to_parent_df = subreddit_data[(subreddit_data["parent_id"] == row['parent_id'])]
      if len(all_replies_to_parent_df) == 0:
        ignore_comments_counter += 1
        continue
      curr_author_all_replies_to_parent_df = all_replies_to_parent_df[(all_replies_to_parent_df["author"] == curr_author)]
      if len(curr_author_all_replies_to_parent_df) == 0:
        ignore_comments_counter += 1
        continue
      else:
        if len(data_method2c[(data_method2c['from_user'] == curr_author)&(data_method2c['to_user'] == curr_parent)].values) > 0: #there exists a row with the curr_author to curr_parent in the same submission already
          ignore_comments_counter += 1
          continue
        else:
          subreddit_id = row['subreddit_id']
          data_method2c.loc[len(data_method2c.index)] = [subreddit_id, curr_link_id, curr_author, curr_parent,len(curr_author_all_replies_to_parent_df), len(all_replies_to_parent_df) ]

  print('total number of comments ignored: ' +str(ignore_comments_counter))
  return data_method2c

data_method2c = method2c_function("",data_fifteen_subreddits)


total number of submission:  6156
total number of comments in this submission =  107352
comment 10000/107352
comment 20000/107352
comment 30000/107352
comment 40000/107352
comment 50000/107352
comment 60000/107352
comment 70000/107352
comment 80000/107352
comment 90000/107352
comment 100000/107352
total number of comments ignored: 54021


In [None]:
data_method2c_final = pd.DataFrame(columns=['from_user','to_user','num','denom'])
print(data_method2c_final)

Empty DataFrame
Columns: [from_user, to_user, num, denom]
Index: []


Create a new dataframe to store the aggregated user pairs across all comments on
 reddit

In [None]:
data_method2c_final = data_method2c.groupby(['from_user', 'to_user'], as_index=False).agg({'num':'sum','denom':'sum'})

In [None]:
print(len(data_method2c_final))
print(data_method2c_final.columns)
print(data_method2c_final.head(140))

53326
Index(['from_user', 'to_user', 'num', 'denom'], dtype='object')
           from_user         to_user  num  denom
0             --AJ--          --AJ--    1      1
1             --AJ--      Switch72nd    1      4
2            --Nylon  tastefulchrist    1      1
3              -9879       Personzoo    1      7
4    -AllInTheGameYo        Bitcoin0    1      1
..               ...             ...  ...    ...
135       0XSavageX0       [deleted]    1      5
136            0_0_0            Dkeh    1      2
137            0_0_0    TheXenophobe    1      1
138            0_0_0  ToastedCupcake    1      1
139             0asq         Seldain    1      3

[140 rows x 4 columns]


In [None]:
data_method2c_final['edgeweight_method2c'] = data_method2c_final['num']/data_method2c_final['denom']

In [None]:
print(len(data_method2c_final))
print(data_method2c_final.head(10))

53326
         from_user            to_user  num  denom  edgeweight_method2c
0           --AJ--             --AJ--    1      1             1.000000
1           --AJ--         Switch72nd    1      4             0.250000
2          --Nylon     tastefulchrist    1      1             1.000000
3            -9879          Personzoo    1      7             0.142857
4  -AllInTheGameYo           Bitcoin0    1      1             1.000000
5  -AllInTheGameYo          YaBoiWhit    1      2             0.500000
6        -Bacchus-       asstasticbum    1      2             0.500000
7          -BruXy-         xtirpation    1      2             0.500000
8   -Bush_Did_911-           Thor4269    1      2             0.500000
9   -Bush_Did_911-  killingALLTHETIME    1      3             0.333333


In [None]:
G_method2c = nx.from_pandas_edgelist(data_method2c_final, "from_user", "to_user", edge_attr="edgeweight_method2c", create_using=nx.DiGraph()) #weight for graph not set

In [None]:
pip install Node2Vec



In [None]:
from node2vec import Node2Vec as n2v

code to visualize graph takes a while to run

In [None]:
pos2 = nx.spring_layout(G_method2c, k=1)  # For better example looking
nx.draw(G_method2c, pos2, with_labels=True)
labels2 = {e: G_method2c.edges[e]['edgeweight_method2c'] for e in G_method2c.edges}
nx.draw_networkx_edge_labels(G_method2c, pos2, edge_labels=labels2)
plt.show()

In [None]:
print(G_method2c.number_of_nodes()) #no.of nodes
print(G_method2c.number_of_edges()) #edges same as number of rows
print(np.mean([d for _, d in G_method2c.degree()])) #average degree of nodes
print(G_method2c.size(weight='edgeweight_method2c'))

28264
53326
3.773422020945372
33542.48149549529


In [None]:
WINDOW = 3 # Node2Vec fit window
MIN_COUNT = 1 # Node2Vec min. count
BATCH_WORDS = 4 # Node2Vec batch words

g_emb2bc = n2v(G_method2c,dimensions=10)

mdl2bc = g_emb2bc.fit(
    vector_size = 128,
    window=WINDOW,
    min_count=MIN_COUNT,
    batch_words=BATCH_WORDS
)

Computing transition probabilities:   0%|          | 0/28264 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [00:48<00:00,  4.89s/it]


information about generated embeddings

In [None]:
print(len(mdl2bc.wv)) #print the length of the embeddings generated = 28264 which is the number of nodes
print(len(mdl2bc.wv[0])) #length of each embedding is 128
print(mdl2bc.wv[0]) #print the first embedding

28264
128
[-0.04756569 -0.05600286  0.08319709  0.66192573 -0.4762074  -0.03954616
 -0.08874775 -0.12673831 -0.42397037  0.30032375 -0.35188013  0.42629907
  0.31019852 -0.14636073  0.2008281  -0.05734066 -0.16071454 -0.28997338
  0.29411852  0.10442841 -0.35429472 -0.1360251  -0.14148103 -0.08895511
  0.12226481 -0.40925172 -0.18205109 -0.42469633  0.17450495  0.1867594
 -0.5572938  -0.22337781  0.21859688  0.2807235   0.12250751 -0.189194
 -0.07265943 -0.23120818  0.05981892 -0.0479635  -0.00469501 -0.28261957
 -0.01660208 -0.3928529   0.38990456 -0.06137879  0.20603807  0.05449421
  0.5568724  -0.35518476 -0.23243506  0.48411533 -0.18260464 -0.28640264
  0.33352974 -0.3169716   0.4409593   0.10333502  0.01238558  0.53555495
  0.11477724 -0.15146197 -0.18843716 -0.09820396  0.33905905 -0.23137882
 -0.3736293   0.31845626  0.3901527  -0.46818408  0.0418357  -0.3079685
 -0.25002787 -0.23046768 -0.00895438 -0.12018925 -0.394856   -0.23954412
  0.4932895  -0.4456054   0.34654805  0.31802

In [None]:
mdl2bc.wv['Madjura']

array([-0.19434017, -0.0531183 , -0.01629902,  0.34204966,  0.04003851,
        0.07169922,  0.02856452, -0.01241161, -0.16449308,  0.18830442,
        0.04451621,  0.05653746,  0.15632635, -0.0278953 ,  0.03989497,
        0.08859412,  0.03278289, -0.19680907,  0.04118812,  0.0128501 ,
       -0.27227205,  0.27190733, -0.06811446, -0.05719217,  0.02986479,
       -0.14604235, -0.05329654, -0.04836642, -0.06584596,  0.15655895,
       -0.09143082, -0.04527028,  0.04880097, -0.03330615,  0.150073  ,
       -0.3390999 ,  0.14553687, -0.14513104, -0.02179531,  0.12446674,
       -0.27733195, -0.3126226 ,  0.10631138, -0.1721232 , -0.10665885,
        0.05659162,  0.03964185, -0.15667503,  0.13860117, -0.08700717,
        0.18424836,  0.0372806 ,  0.0876261 , -0.27111998,  0.13576278,
       -0.10534384,  0.05218143,  0.17975566,  0.0069162 ,  0.00078402,
        0.01080272, -0.07021152, -0.17858607, -0.04562951,  0.0173197 ,
       -0.10818859, -0.0851758 ,  0.05082953,  0.09084512, -0.29

Function to find cosine similarity

In [None]:
import numpy as np
def cosine_sim(vector1, vector2):
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

In [None]:
w = mdl2.wv['Madjura']
x = mdl2b.wv['Madjura']
z = mdl2bc.wv['Madjura']
y = mdl2bc.wv['reallynowokaywhat']
print(cosine_sim(z, y))
print(cosine_sim(w, x)) #difference between node embedding of mathod 2a with 2b
print(cosine_sim(w, z)) #difference between node embedding of mathod 2a with 2c
print(cosine_sim(x, z)) #difference between node embedding of mathod 2b with 2c #method 2b is closer to method 2c, than both to 2a

0.8441807
0.22523181
0.20157982
0.07203801


In [None]:
comment_id = 'Madjura'
for s in mdl2bc.wv.most_similar(comment_id, topn = 10):
    print(s)

('reallynowokaywhat', 0.844180703163147)
('porzeegod', 0.8419408798217773)
('ph147', 0.8374040126800537)
('Qwertywalkers23', 0.837172269821167)
('ItsTyrrellYo', 0.8370135426521301)
('NachoQueen_', 0.8361842632293701)
('shmoozy', 0.8336062431335449)
('NeonCheese1', 0.8322792053222656)
('mansionsong', 0.8291450142860413)
('DrunkenSexcupcakes', 0.8281821012496948)


store the generated embeddings

Create a new dataframe to store the cosine similarity between the sender node and the receiver node, since it is method 2c we do not have access to the submission id of the user pair as well as the subreddit id, we do not include them in the dataframe

In [None]:
data_method2c_node2vec = pd.DataFrame(columns=['from_user','to_user','edgeweight_method2c','cosine_similarity'])
print(data_method2c_node2vec)

Empty DataFrame
Columns: [from_user, to_user, edgeweight_method2c, cosine_similarity]
Index: []


In [None]:
i = 0
print('generating cosine similarity for '+str(len(data_method2c_final))+' user pairs'+'\n')

for ind, row in data_method2c_final.iterrows():

  i += 1
  if i % 1000 == 0:
    print('finished '+ str(i)+'/'+str(len(data_method2c_final))+' user pairs')

  #get user pair and calculate the cosin similarity between their node embeddings
  from_user = row['from_user']
  to_user = row['to_user']
  from_user_embedding = mdl2bc.wv[from_user]
  from_to_embedding = mdl2bc.wv[to_user]

  cos_sim_between_from_to_user = cosine_sim(from_user_embedding, from_to_embedding)

  curr_edge_weight = row['edgeweight_method2c']

  data_method2c_node2vec.loc[len(data_method2c_node2vec.index)] = [from_user, to_user, curr_edge_weight, cos_sim_between_from_to_user]

generating cosine similarity for 53326 user pairs

finished 1000/53326 user pairs
finished 2000/53326 user pairs
finished 3000/53326 user pairs
finished 4000/53326 user pairs
finished 5000/53326 user pairs
finished 6000/53326 user pairs
finished 7000/53326 user pairs
finished 8000/53326 user pairs
finished 9000/53326 user pairs
finished 10000/53326 user pairs
finished 11000/53326 user pairs
finished 12000/53326 user pairs
finished 13000/53326 user pairs
finished 14000/53326 user pairs
finished 15000/53326 user pairs
finished 16000/53326 user pairs
finished 17000/53326 user pairs
finished 18000/53326 user pairs
finished 19000/53326 user pairs
finished 20000/53326 user pairs
finished 21000/53326 user pairs
finished 22000/53326 user pairs
finished 23000/53326 user pairs
finished 24000/53326 user pairs
finished 25000/53326 user pairs
finished 26000/53326 user pairs
finished 27000/53326 user pairs
finished 28000/53326 user pairs
finished 29000/53326 user pairs
finished 30000/53326 user pair

In [None]:
print(len(data_method2c_node2vec))
print(data_method2c_node2vec.columns)
print(data_method2c_node2vec.head(3))

53326
Index(['from_user', 'to_user', 'edgeweight_method2c', 'cosine_similarity'], dtype='object')
  from_user         to_user  edgeweight_method2c  cosine_similarity
0    --AJ--          --AJ--                 1.00           1.000000
1    --AJ--      Switch72nd                 0.25           0.984488
2   --Nylon  tastefulchrist                 1.00           0.911960


In [None]:
data_method2c_node2vec.to_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_method2c_node2vec.csv',index=False)

Confirming that the edge lists from method 2b and 2c are different

In [None]:
from pandas.testing import assert_frame_equal
data_method2b_final_sub = data_method2b_final[['from_user', 'to_user', 'edgeweight_method2b']]
data_method2c_final_sub = data_method2c_final[['from_user', 'to_user', 'edgeweight_method2c']]
df11 = data_method2b_final_sub.sort_values(by=data_method2b_final_sub.columns.tolist()).reset_index(drop=True)
df21 = data_method2c_final_sub.sort_values(by=data_method2c_final_sub.columns.tolist()).reset_index(drop=True)
print (df11.equals(df21))
#assert_frame_equal(data_method2b_final_sub, data_method2c_final_sub, check_like = True)

False
