# **Creating the NetworkX graph for askscience subreddit, Generating the Node Embeddings, and Calculating the Network Similarity**

This notebook considers the askscience subreddit and generates the NetworkX graphs.

The interaction between user 'i' and user 'j' is captured during the following metric-

*(Number of comments between i and j)/(Number of comments sent by all users to j)*

---
note: if you have already run 'Submissions_Processing_askscience.ipynb' to combine the comments and submissions of askscience subredidt, then skip the processing part of Part 1 and directly go to its end to read the already processed file. Incase you only want to use comments data, then run part 1 which reads and processes comments.

**Part 1: Reading the data**<br>
Reads comments data of askscience.<br>
**Part 2: Generate the Graph using Method 2c**<br>
  The unit level is the entire reddit data<br>
  Output file generated: 'data_askscience_subreddits_method2c_node2vec.csv'<br>

### note: The column 'cosine_similarity' in all output files is actually the 'Network Similarity'
---
OUTPUT_FILES:<br>
1. 'data_askscience_subreddits_method2c_node2vec.csv': contains the unique user pairs and their network similarity (cosine similarity between node embeddings)
.


.


---
# **Part 1: Reading the data**

In this section, I have read the ask science subreddit. Also the data is preprocessed and converted to a dataframe. The preprocessing involved removing all those comments with author as 'deleted' and body as 'removed', or some combination of 'deleted' and 'removed'

---
.

Check if cuda is being used

In [23]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cuda.


Connect to drive

In [24]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


note: ignore from here till next mentioned if you have already run 'Submissions_Processing_askscience.ipynb' to combine the comments and submissions of askscience subredidt, then skip the processing part of Part 1 and directly go to its end to read the already processed file. Incase you only want to use comments data, then run part 1 which reads and processes comments.

Read the data containing fifteen subreddits

In [None]:
import pandas as pd
data_df = pd.read_json('/content/gdrive/MyDrive/Colab Notebooks/askscience/askscience_comments_10_2022.ndjson')

ValueError: ignored

throws error which means that the file has to be processed

The first step is to convert the file into a valid json format. For this, place the entire file contents in two square brackets. Also, seperate each of the rows which are enclosed in two curly braces with a comma. First we will parse through the entire file and add a comma after every row/curly bracket pair

In [None]:
b = open("/content/gdrive/MyDrive/Colab Notebooks/askscience/askscience_comments_10_2022.ndjson","r")
bb = b.readlines()
index = 0

for x in bb:
  g = bb[index]+','
  print(g)
  with open('/content/gdrive/MyDrive/Colab Notebooks/askscience/askcience_processed.ndjson','a') as fle:
    fle.write(g)
  index += 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{"id":"iu0xdnj","subreddit":"askscience","body":"Hey I'm the same. Back to ED after not being in a clinical position for a long time, only this time I'm in a peds ED and it's really bad. RSV all day every day. We have so many kids that they don't even have private rooms for all of them so they go in shared open bays. Long wait times. Running out of high flow machines.","author":"evdczar","score":6,"gilded":0,"created_utc":1666897379,"parent_id":"iu0sbxi","link_id":"yere7r","retrieved_on":1667876590,"controversiality":0,"is_submitter":false}
,
{"id":"iu0xig7","subreddit":"askscience","body":"[removed]","author":"[deleted]","score":1,"gilded":0,"created_utc":1666897432,"parent_id":"yere7r","link_id":"yere7r","retrieved_on":1667876584,"controversiality":0,"is_submitter":false}
,
{"id":"iubg3ey","subreddit":"askscience","body":"Keep in mind that doctors are still human, and humans are susceptible to biases, misinformation, an

But now the file has to be editted as there is an extra comma after the end of the last row. So download the file from drive, open it up and remove the extra comma and also add a square bracket at the beginning and at the end of the file, and upload it back to the drive

now the file 'askcience_processed.ndjson' has to be cleaned

In [25]:
with open('/content/gdrive/MyDrive/Colab Notebooks/askscience/askcience_processed.ndjson','r') as fle1:
  cur_line = fle1.readlines()

output= ""
count = 0
for line in cur_line:
  line = line.replace("\\n", "" )
  line = line.replace('\\"', "" )
  line = line.replace('\\', "" )
  line = line.replace("'", "" )
  # line = line.replace('"', "" )
  if count==4:
    print(line)
  output+=line
  count+=1

with open('/content/gdrive/MyDrive/Colab Notebooks/askscience/askcience_processed_1.ndjson','w') as f2:
  f2.writelines(output)

,{"id":"iqkfmj9","subreddit":"askscience","body":"It *absolutely* implies an expectation, even if it doesnt imply an inevitability. If you said He hasnt arrived yet, that means the person is *supposed* to arrive even if they turn out not to. You could stick with we dont know, but I say it just is because when it comes to things like fundamental properties of nature, I dont think the *causes* of such things are objects of science. If you think they are potential objects of science, Im curious about your perspective.How do you think youre going to determine *why* the speed of light is what is it, beyond *what* it is? Thats a question that invokes the cause of the universe itself. What kind of observation that is possible from within the universe could even *conceivably* answer that question? As far as science is concerned, reality is a closed system.","author":"chop1n","score":3,"gilded":0,"created_utc":1664583378,"parent_id":"iqker6l","link_id":"xs73nx","retrieved_on":1664960507,"contro

In [26]:
import pandas as pd
data = pd.read_json('/content/gdrive/MyDrive/Colab Notebooks/askscience/askcience_processed_1.ndjson',encoding_errors='ignore')

In [27]:
data.head(10)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,parent_id,link_id,retrieved_on,controversiality,is_submitter
0,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,iqkee0k,xs73nx,1664960533,0,False
1,iqkewq0,askscience,while insect muscle might be similar to ours s...,regular_modern_girl,452,0,1664583016,iqjssf5,xs9pjy,1664960528,0,False
2,iqkfdmz,askscience,[removed],[deleted],1,0,1664583252,iqkb49u,xs9pjy,1664960514,0,False
3,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,iqke0xc,xs1k1y,1664960508,0,False
4,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,iqker6l,xs73nx,1664960507,0,False
5,iqkfrm5,askscience,"PhD in yeast genetics here, so I’ve streaked t...",smallwhitedog,4,0,1664583450,xs1k1y,xs1k1y,1664960502,0,False
6,iqkfsgy,askscience,Others have given great reasons for why our si...,moewind420,6,0,1664583462,xs73nx,xs73nx,1664960501,0,False
7,iqkft4v,askscience,[removed],[deleted],1,0,1664583472,xs1k1y,xs1k1y,1664960501,0,False
8,iqkfzvn,askscience,Inside a living human isn’t lightless dark. Li...,sovietamerican,7,0,1664583564,iqk1nsq,xs4rhf,1664960495,0,False
9,iqkg3r0,askscience,Wordy is good. your explanation is helping me...,tonytoews,2,0,1664583615,iqk8u6o,xs73nx,1664960492,0,False


store this dataframe in a csv so that next time the code can be run from below by directly reading the processed cvs

In [None]:
data.to_csv("/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience.csv", index=False)

start from here if you are running the program a new time

note: some of the column names are different

In [28]:
import pandas as pd
data_askscience = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience.csv', low_memory=False)
print(len(data_askscience)) #length of data = 26605
print(len(pd.unique(data_askscience['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_askscience['id']))) #unique number of comments = 26605 #the data is at the comment level
print(len(pd.unique(data_askscience['parent_id']))) #number of parent nodes = 10538
print(len(pd.unique(data_askscience['link_id']))) #number of submissions = 3004
print(len(pd.unique(data_askscience['author']))) #number of submissions = 6629
print(len(data_askscience.columns)) # = 12

26605
1
26605
10538
3004
6629
12


In [None]:
data_askscience.head(10)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,parent_id,link_id,retrieved_on,controversiality,is_submitter
0,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,iqkee0k,xs73nx,1664960533,0,False
1,iqkewq0,askscience,while insect muscle might be similar to ours s...,regular_modern_girl,452,0,1664583016,iqjssf5,xs9pjy,1664960528,0,False
2,iqkfdmz,askscience,[removed],[deleted],1,0,1664583252,iqkb49u,xs9pjy,1664960514,0,False
3,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,iqke0xc,xs1k1y,1664960508,0,False
4,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,iqker6l,xs73nx,1664960507,0,False
5,iqkfrm5,askscience,"PhD in yeast genetics here, so I’ve streaked t...",smallwhitedog,4,0,1664583450,xs1k1y,xs1k1y,1664960502,0,False
6,iqkfsgy,askscience,Others have given great reasons for why our si...,moewind420,6,0,1664583462,xs73nx,xs73nx,1664960501,0,False
7,iqkft4v,askscience,[removed],[deleted],1,0,1664583472,xs1k1y,xs1k1y,1664960501,0,False
8,iqkfzvn,askscience,Inside a living human isn’t lightless dark. Li...,sovietamerican,7,0,1664583564,iqk1nsq,xs4rhf,1664960495,0,False
9,iqkg3r0,askscience,Wordy is good. your explanation is helping me...,tonytoews,2,0,1664583615,iqk8u6o,xs73nx,1664960492,0,False


note: the 'id' column seems to be the comment id, whereas the 'parent_id' comment seems to be a link to the parent comment. In this data the parent id is not preceded by the 3 characters

need to remove the rows where body is 'removed' and author is 'deleted', or some combination of the two

In [None]:
data_askscience.loc[data_askscience.author == '[deleted]', 'author'].count()

13334

In [None]:
data_askscience.loc[data_askscience.author == '[removed]', 'author'].count()

0

In [None]:
data_askscience.loc[data_askscience.body == '[deleted]', 'body'].count()

369

In [None]:
data_askscience.loc[data_askscience.body == '[removed]', 'body'].count()

12926

these need to be removed

In [None]:
data_askscience = data_askscience[data_askscience['body'] != '[removed]']
data_askscience = data_askscience[data_askscience['body'] != '[deleted]']
print(len(data_askscience))
print(len(pd.unique(data_askscience['author'])))
print(data_askscience.head(3))

13310
6628
        id   subreddit                                               body  \
0  iqker6l  askscience  No it does not imply that. “We don’t yet know”...   
1  iqkewq0  askscience  while insect muscle might be similar to ours s...   
3  iqkfl8j  askscience  Pasteurization works by heating (generally a l...   

                author  score  gilded  created_utc parent_id link_id  \
0          omniskeptic      2       0   1664582942   iqkee0k  xs73nx   
1  regular_modern_girl    452       0   1664583016   iqjssf5  xs9pjy   
3       jeweledjuniper     11       0   1664583360   iqke0xc  xs1k1y   

   retrieved_on  controversiality  is_submitter  
0    1664960533                 0         False  
1    1664960528                 0         False  
3    1664960508                 0         False  


In [None]:
data_askscience = data_askscience[data_askscience['author'] != '[removed]']
data_askscience = data_askscience[data_askscience['author'] != '[deleted]']
print(len(data_askscience))
print(len(pd.unique(data_askscience['author'])))
print(data_askscience.head(3))

13270
6627
        id   subreddit                                               body  \
0  iqker6l  askscience  No it does not imply that. “We don’t yet know”...   
1  iqkewq0  askscience  while insect muscle might be similar to ours s...   
3  iqkfl8j  askscience  Pasteurization works by heating (generally a l...   

                author  score  gilded  created_utc parent_id link_id  \
0          omniskeptic      2       0   1664582942   iqkee0k  xs73nx   
1  regular_modern_girl    452       0   1664583016   iqjssf5  xs9pjy   
3       jeweledjuniper     11       0   1664583360   iqke0xc  xs1k1y   

   retrieved_on  controversiality  is_submitter  
0    1664960533                 0         False  
1    1664960528                 0         False  
3    1664960508                 0         False  


confirm that there are no more of the incorrect rows

In [None]:
print(data_askscience.loc[data_askscience.author == '[deleted]', 'author'].count())
print(data_askscience.loc[data_askscience.author == '[removed]', 'author'].count())
print(data_askscience.loc[data_askscience.body == '[deleted]', 'body'].count())
print(data_askscience.loc[data_askscience.body == '[removed]', 'body'].count())

0
0
0
0


continue from here is you have already executed 'Submissions_Processing_askscience.ipynb' and have a processed file called 'data_merged_askscience.csv'. <br>

Ignore below if you have been processing comments using the above code in Part 1 and did not run 'Submissions_Processing_askscience.ipynb'.

In [29]:
import pandas as pd
data_askscience = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_merged_askscience.csv', low_memory=False)
print(len(data_askscience)) #length of data =
print(len(pd.unique(data_askscience['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_askscience['id']))) #unique number of comments = #the data is at the comment/submission level
print(len(pd.unique(data_askscience['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_askscience['link_id']))) #number of submissions =
print(len(pd.unique(data_askscience['author']))) #number of submissions =
print(len(data_askscience.columns)) # = 12

18949
1
18949
7790
2867
10895
17


In [32]:
data_askscience.head(3)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,parent_id,link_id,controversiality,is_submitter
0,xsjqzy,askscience,Why do I poop after a glass or two of beer or ...,depressedchiq,1,0,1664591533,1665426496,/r/askscience/comments/xsjqzy/why_do_i_poop_af...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True
1,xsju1p,askscience,If you heated water under immense pressure so ...,zhongliabuse,1,0,1664591793,1665426492,/r/askscience/comments/xsju1p/if_you_heated_wa...,1,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True
2,xsjx0l,askscience,What is the evolutionary goal of nose elongati...,_ozeki,1,0,1664592050,1665426489,/r/askscience/comments/xsjx0l/what_is_the_evol...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True


In [33]:
data_askscience.tail(3)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,parent_id,link_id,controversiality,is_submitter
18946,iuk9rbq,askscience,I dont understand why you think a local geolog...,tricksterwolf,5,0,1667259694,1667844629,0,0,0,0,0,iuj9y8f,yiizwf,0,False
18947,iuk9uu7,askscience,&gt;but sometimes bacteria wait until they hav...,tedivm,4,0,1667259740,1667844624,0,0,0,0,0,iujz6c0,yia9a5,0,False
18948,iuk9xwj,askscience,Not quite answering the question but fever is ...,lost_in_antartica,2,0,1667259779,1667844621,0,0,0,0,0,yi3t9o,yi3t9o,0,False


how many rows are comments

In [35]:
data_askscience.loc[data_askscience.is_submitter == False, 'is_submitter'].count()

12918

how many rows are submissions

In [36]:
data_askscience.loc[data_askscience.is_submitter == True, 'is_submitter'].count()

6031

12918 + 6031 = 18949

---
# **Part 2: Generate the Graph using Method 2c**<br>

---


 The unit level is the entire reddit data

Create an empty dataframe to store the edge weights geenrated using method 2c

In [40]:
#data_method2c = pd.DataFrame(columns=['subreddit_id','submission_id','from_user','to_user','num', 'denom'])
#here we have only one subreddit so we ignore the column 'subreddit_id'
data_method2c = pd.DataFrame(columns=['submission_id','from_user','to_user','num', 'denom'])
print(data_method2c)

Empty DataFrame
Columns: [submission_id, from_user, to_user, num, denom]
Index: []


In [41]:
def method2c_function(subreddit_id, subreddit_data):

  case1 = 0
  no_submissions = 0
  case2 = 0
  case3 = 0
  case4 = 0
  case5 = 0

  submissions_list = subreddit_data['link_id'].unique()

  if subreddit_id != "":
    print("Consider subreddit with ID: ",subreddit_id)
  print("total number of submission: ",len(submissions_list))
  counter = 0
  ignore_comments_counter =  0

  #additional code to resolve an error
  type_base = type(subreddit_data['parent_id'].iloc[0])

  #number of comments on the subreddit
  tot_comments = len(subreddit_data)
  print("total number of comments in this submission = ", tot_comments)

  i = 0

  #iterate across the current submission
  for index, row in subreddit_data.iterrows():

      i += 1
      if i % 10000 == 0:
        print('comment '+str(i)+'/'+str(len(subreddit_data)))

      curr_link_id = row['link_id']
      curr_author = row['author']
      if type(row['parent_id']) != type_base:
        case1 += 1
        continue
      if row['is_submitter'] == True: #the row is a submission
        no_submissions += 1
        continue

      #curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters
      curr_parent_comment_id = row['parent_id'] #noticed that the parent id has been processed to be the same as a comment id


      if(len(subreddit_data[subreddit_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
        ignore_comments_counter += 1
        print(row['parent_id'])
        case2 += 1
        continue

      #print('Found a valid parent comment in the submission')
      curr_parent = subreddit_data[subreddit_data['id'] == curr_parent_comment_id]['author'].values[0]

      all_replies_to_parent_df = subreddit_data[(subreddit_data["parent_id"] == row['parent_id'])]
      if len(all_replies_to_parent_df) == 0:
        ignore_comments_counter += 1
        case3 += 1
        continue
      curr_author_all_replies_to_parent_df = all_replies_to_parent_df[(all_replies_to_parent_df["author"] == curr_author)]
      if len(curr_author_all_replies_to_parent_df) == 0:
        ignore_comments_counter += 1
        case4 += 1
        continue
      else:
        if len(data_method2c[(data_method2c['from_user'] == curr_author)&(data_method2c['to_user'] == curr_parent)].values) > 0: #there exists a row with the curr_author to curr_parent in the same submission already
          ignore_comments_counter += 1
          case5 += 1
          continue
        else:
          #subreddit_id = row['subreddit_id'] ignore as there is only 1 subreddit called askscience
          data_method2c.loc[len(data_method2c.index)] = [curr_link_id, curr_author, curr_parent,len(curr_author_all_replies_to_parent_df), len(all_replies_to_parent_df) ]

  print('total number of comments ignored: ' +str(ignore_comments_counter))
  print('case1 = '+str(case1))
  print('case2 = '+str(case2))
  print('case3 = '+str(case3))
  print('case4 = '+str(case4))
  print('case5 = '+str(case5))
  print('no of submissions = '+str(no_submissions))
  return data_method2c

data_method2c = method2c_function("",data_askscience)


total number of submission:  2867
total number of comments in this submission =  18949
iqjssf5
xs1k1y
xs73nx
iqk1nsq
iqk8u6o
iqkaj2g
iqichkb
iqjirl2
iqkaihm
xsgkn7
xsfixm
xsdzls
iqjttey
iqkb5t9
xrpq10
xrr4p8
xs73nx
iqjkd0o
iqjkd0o
iqjliz7
iqjawje
xrbv5x
iqjdcpw
xrqekh
iqkdhbh
iqjssf5
iqkdhbh
iqjssf5
iqjmjt9
iqhp8u5
iqk6f47
xrr4p8
iq9gk4h
xsdkvd
xsdkvd
iqjssf5
iqiyt8v
iqge47b
xsmj1a
iqhpfwt
iqjawje
xs9pjy
xrr4p8
iqiz8cr
iqjawje
xs73nx
iqjawje
xs1k1y
iqjekwj
xsnbyp
iqichkb
iqk096u
iqjc41x
iqk8u6o
xsdkvd
iqk7ay2
xsdkvd
iqjawje
iqjvwhg
iqk7ay2
iqjuh0i
iqk1nsq
iqk1nsq
iqj3au0
iqjbujn
iqj2ykl
xs73nx
iqimw5o
xsqx3s
xstd31
xstk1t
iqk1nsq
iqep2cd
xsu76i
iqkdhbh
xsrdb4
iqjyfco
iqjuh0i
iqk7ay2
xsvlqc
iqk1nsq
iqjssf5
iqjliz7
iqhnjfr
xsxcyv
xs73nx
iqkdhbh
iqjssf5
iqje0d7
iqgram4
iqgww3z
iqhsfa4
iqjprqs
xrh717
iqjybe7
xsyz18
iqhd5zy
xsz6yd
xs9llv
xt1zgc
iqjssf5
xt1t4o
iqjssf5
iqhqnur
xt2fhw
xsdkvd
xt68bi
iqjzaw6
xsz6yd
xt82ev
xt0i9m
iqjssf5
iqjssf5
xtb5ok
xtb2r8
xtavu7
xtdn5h
iqonjiz
iqpp30b
iqppqft

check that one of the ignored comments doesnt actually have a parent

In [43]:
data_askscience.loc[data_askscience.id == 'iuj9y8f', 'id'].count()

0

so for some reason, 1818 comments do not have a valid parent even after checking the submissions data. they must have parents from the previous month as this is a one month snapshot

In [47]:
print(len(data_method2c))
data_method2c.head(30)

11100


Unnamed: 0,submission_id,from_user,to_user,num,denom
0,xs73nx,omniskeptic,chop1n,1,1
1,xs1k1y,jeweledjuniper,feitingen,1,1
2,xs73nx,chop1n,omniskeptic,1,1
3,xshtfw,askscience-modteam,bhjoduar,1,1
4,xs4rhf,greese007,blscratch,1,1
5,xs9pjy,viciousfishous08,thelogicalghost,1,1
6,xs73nx,tin_man6328,moewind420,1,1
7,xs73nx,yeswehavenotomatoes,balazer,1,1
8,xs9pjy,mib_sum1ls,thelogicalghost,1,4
9,xs9pjy,glomgore,mib_sum1ls,1,1


was able to calculate the edge weights for 11100 user pairs

note: for 1818/18949 comments, the parent comment was not found.

In [46]:
data_method2c_final = pd.DataFrame(columns=['from_user','to_user','num','denom'])
print(data_method2c_final)

Empty DataFrame
Columns: [from_user, to_user, num, denom]
Index: []


Create a new dataframe to store the aggregated user pairs across all comments on
 reddit

In [48]:
data_method2c_final = data_method2c.groupby(['from_user', 'to_user'], as_index=False).agg({'num':'sum','denom':'sum'})

In [49]:
print(len(data_method2c_final))
print(data_method2c_final.columns)
print(data_method2c_final.head(140))

11100
Index(['from_user', 'to_user', 'num', 'denom'], dtype='object')
             from_user         to_user  num  denom
0            --tenet--   automoderator    1     89
1         -1kingkrool-        web-dude    1      3
2             -banned-  stimulatedecho    1      1
3           -cosmonaut  desi_launda101    1     33
4               -domi-       --tenet--    1      7
..                 ...             ...  ...    ...
135  _pm_me_pangolins_       coinnoob7    1     20
136  _pm_me_pangolins_        dmc_2930    1      2
137  _pm_me_pangolins_          duncle    1      1
138             _qoop_   onceisforever    1     14
139     _turkeyfucker_        forepony    1      2

[140 rows x 4 columns]


In [50]:
data_method2c_final['edgeweight_method2c'] = data_method2c_final['num']/data_method2c_final['denom']

In [51]:
print(len(data_method2c_final))
print(data_method2c_final.head(10))

11100
          from_user             to_user  num  denom  edgeweight_method2c
0         --tenet--       automoderator    1     89             0.011236
1      -1kingkrool-            web-dude    1      3             0.333333
2          -banned-      stimulatedecho    1      1             1.000000
3        -cosmonaut      desi_launda101    1     33             0.030303
4            -domi-           --tenet--    1      7             0.142857
5            -domi-           cptgrudge    1      2             0.500000
6          -hastis-         theuturn2yz    1     11             0.090909
7         -hominid-  inpurpleidescended    1      3             0.333333
8  -kibbles-n-tits-            cardiomg    1      1             1.000000
9      -metacelsus-             agood10    1      5             0.200000


In [52]:
import networkx as nx

In [53]:
G_method2c = nx.from_pandas_edgelist(data_method2c_final, "from_user", "to_user", edge_attr="edgeweight_method2c", create_using=nx.DiGraph()) #weight for graph not set

In [54]:
pip install Node2Vec

Collecting Node2Vec
  Downloading node2vec-0.4.6-py3-none-any.whl (7.0 kB)
Collecting networkx<3.0,>=2.5 (from Node2Vec)
  Downloading networkx-2.8.8-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: networkx, Node2Vec
  Attempting uninstall: networkx
    Found existing installation: networkx 3.2.1
    Uninstalling networkx-3.2.1:
      Successfully uninstalled networkx-3.2.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.[0m[31m
[0mSuccessfully installed Node2Vec-0.4.6 networkx-2.8.8


In [55]:
from node2vec import Node2Vec as n2v

code to visualize graph takes a while to run

In [56]:
import matplotlib as plt
import numpy as np

In [None]:
pos2 = nx.spring_layout(G_method2c, k=1)  # For better example looking
nx.draw(G_method2c, pos2, with_labels=True)
labels2 = {e: G_method2c.edges[e]['edgeweight_method2c'] for e in G_method2c.edges}
nx.draw_networkx_edge_labels(G_method2c, pos2, edge_labels=labels2)
plt.show()

KeyboardInterrupt: ignored

In [57]:
print(G_method2c.number_of_nodes()) #no.of nodes
print(G_method2c.number_of_edges()) #edges same as number of rows
print(np.mean([d for _, d in G_method2c.degree()])) #average degree of nodes
print(G_method2c.size(weight='edgeweight_method2c'))

8119
11100
2.7343268875477276
6199.509119602103


In [58]:
WINDOW = 3 # Node2Vec fit window
MIN_COUNT = 1 # Node2Vec min. count
BATCH_WORDS = 4 # Node2Vec batch words

g_emb2bc = n2v(G_method2c,dimensions=10)

mdl2bc = g_emb2bc.fit(
    vector_size = 128,
    window=WINDOW,
    min_count=MIN_COUNT,
    batch_words=BATCH_WORDS
)

Computing transition probabilities:   0%|          | 0/8119 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [00:02<00:00,  4.71it/s]


information about generated embeddings

In [59]:
print(len(mdl2bc.wv)) #print the length of the embeddings generated = 8119 which is the number of nodes
print(len(mdl2bc.wv[0])) #length of each embedding is 128
print(mdl2bc.wv[0]) #print the first embedding

8119
128
[-1.1390388  -0.58902365  0.57918966  0.29086542  0.21750066  0.43635342
 -0.01045428  0.33350882 -0.7905867   0.4016738   0.7084855   0.4559694
 -0.5995863   0.03402391  0.32834423  0.406008    0.05034577 -0.4273784
 -0.9273991   1.0838814   0.30316654  0.42142907  0.0140845  -1.4240818
 -0.47188163  0.555894   -0.23990038  1.2976651  -0.6634823   0.2941935
 -1.0653911  -0.34623984 -0.8324783  -0.64639026  0.81862205  0.25773138
  0.8429811  -1.1306467   0.23375015 -0.18343708 -0.5000748   0.3393889
 -0.48941714  0.08795022  0.9518539   0.80185467 -0.8419977   0.338032
 -0.40125296  0.11506318  0.74403864  0.75496095 -0.70748097  0.9623635
 -0.04601167 -1.2425113   0.8363594  -0.0098894  -0.09848891  0.37259364
 -1.2122347  -0.5961342   0.04115847 -0.9551357  -0.2052725  -0.2838916
 -0.2646359   0.9801216  -1.1787679   0.6300639   0.19489479 -0.8021638
 -0.90065396 -1.2991941   1.0529792  -0.2117117  -0.42104033 -0.4625939
 -0.22030947 -0.09285962 -0.39882562  0.22749476  0.7

In [60]:
mdl2bc.wv['omniskeptic']

array([ 0.06000087, -0.34073174,  0.13287199,  0.00433089,  0.07013424,
        0.02141614, -0.23767953,  0.02733303, -0.38480076,  0.2202409 ,
        0.26319203,  0.13504991, -0.16875824, -0.24829689,  0.01491444,
       -0.14691168,  0.0558856 ,  0.06889694, -0.5642429 ,  0.6506243 ,
       -0.07522672,  0.29741865, -0.01737268, -0.19979724, -0.2304017 ,
        0.13977899, -0.0144087 ,  0.2536338 , -0.56519866,  0.17135857,
       -0.1617568 , -0.08471223,  0.08403223, -0.11633076,  0.12891017,
        0.15555961,  0.24815433, -0.39869484,  0.14767234,  0.10096043,
       -0.04843248, -0.08951047, -0.27701727,  0.0942157 ,  0.40911123,
        0.13317758,  0.2556325 , -0.0565758 , -0.04022529, -0.33381703,
        0.4314237 ,  0.11968478, -0.38308766,  0.07960983, -0.3096217 ,
       -0.30606425,  0.11857907, -0.16624412, -0.1166245 , -0.1996661 ,
       -0.07877335, -0.06680552,  0.034343  ,  0.3321648 ,  0.19086133,
       -0.4315842 , -0.09504814, -0.01092439,  0.35448027,  0.24

Function to find cosine similarity

In [61]:
import numpy as np
def cosine_sim(vector1, vector2):
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

In [62]:
w = mdl2bc.wv['omniskeptic']
z = mdl2bc.wv['regular_modern_girl']
y = mdl2bc.wv['chop1n']
print(cosine_sim(w, z))
print(cosine_sim(z, y))
print(cosine_sim(y, w))

0.26857308
0.30504432
0.9856476


In [63]:
comment_id = 'omniskeptic'
for s in mdl2bc.wv.most_similar(comment_id, topn = 10):
    print(s)

('chop1n', 0.9856476187705994)
('babyyodasdirtydiaper', 0.9272844195365906)
('vt_squire', 0.9223006963729858)
('ohmysatanharderplz', 0.9157809019088745)
('dukuel', 0.9110397100448608)
('6threplacementmonkey', 0.7944790124893188)
('rasputin170', 0.7930231094360352)
('monkno5', 0.7899104952812195)
('8esix', 0.7879255414009094)
('marozsas', 0.7860144972801208)


store the generated embeddings

Create a new dataframe to store the cosine similarity between the sender node and the receiver node, since it is method 2c we do not have access to the submission id of the user pair as well as the subreddit id, we do not include them in the dataframe

In [64]:
data_method2c_node2vec = pd.DataFrame(columns=['from_user','to_user','edgeweight_method2c','cosine_similarity'])
print(data_method2c_node2vec)

Empty DataFrame
Columns: [from_user, to_user, edgeweight_method2c, cosine_similarity]
Index: []


In [65]:
i = 0
print('generating cosine similarity for '+str(len(data_method2c_final))+' user pairs'+'\n')

for ind, row in data_method2c_final.iterrows():

  i += 1
  if i % 1000 == 0:
    print('finished '+ str(i)+'/'+str(len(data_method2c_final))+' user pairs')

  #get user pair and calculate the cosin similarity between their node embeddings
  from_user = row['from_user']
  to_user = row['to_user']
  from_user_embedding = mdl2bc.wv[from_user]
  from_to_embedding = mdl2bc.wv[to_user]

  cos_sim_between_from_to_user = cosine_sim(from_user_embedding, from_to_embedding)

  curr_edge_weight = row['edgeweight_method2c']

  data_method2c_node2vec.loc[len(data_method2c_node2vec.index)] = [from_user, to_user, curr_edge_weight, cos_sim_between_from_to_user]

generating cosine similarity for 11100 user pairs

finished 1000/11100 user pairs
finished 2000/11100 user pairs
finished 3000/11100 user pairs
finished 4000/11100 user pairs
finished 5000/11100 user pairs
finished 6000/11100 user pairs
finished 7000/11100 user pairs
finished 8000/11100 user pairs
finished 9000/11100 user pairs
finished 10000/11100 user pairs
finished 11000/11100 user pairs


In [66]:
print(len(data_method2c_node2vec))
print(data_method2c_node2vec.columns)
print(data_method2c_node2vec.head(3))

11100
Index(['from_user', 'to_user', 'edgeweight_method2c', 'cosine_similarity'], dtype='object')
      from_user         to_user  edgeweight_method2c  cosine_similarity
0     --tenet--   automoderator             0.011236           0.754318
1  -1kingkrool-        web-dude             0.333333           0.984803
2      -banned-  stimulatedecho             1.000000           0.978135


In [67]:
data_method2c_node2vec.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_subreddits_method2c_node2vec.csv',index=False)