
---
# Mapping Every Comment to a Network and Cultural Similarity Measure, and a Parent Comment

This python notebook maps each comment to a cosin similarity value between the node embedding of the sender and the receiver of the comment (this cosine similarity is both a cultural similarity and a network similarity). The unique user pairs similairity between these senders and receivers are in the output file 'data_askscience_subreddits_method2c_node2vec.csv' obtained from the 'Node_Embeddings_For_askscience_Subreddits.ipynb' notebook, as well as in the 'data_askscience_user_pair_level_culsim.csv' file from the 'Word_Embeddings_For_askscience_Subreddits.ipynb' notebook. The notebook also maps each comment to a parent-comment-author, as well as a cultural similarity.<br>

Thus, each comment has been mapped to a network similarity, cultural similarity, and a parent comment author.

OUTPUT FILES:<br>
1. 'similarity_askscience_subreddits.csv' : which contains all the comments of the askscience subreddit, only some of which have a network similarity and cultural similarity measure
2. 'data_similarity_askscience_subreddits.csv' : which contains only the comments which have both a network similarity measure and a cultural similarity measure and a parent comment

---

askscience subreddit- additional details
the askscience unprocessed data has 26605 comments
there are 13270 comments left after rows with either author as '[deleted]' or body as '[removed]' are removed
there are 7478 user pairs for which there is a network similarity
there are 7478 user pairs for which there is a cultural similarity
thus there should be 7478 unique user pairs
thus for 7952 comments, there is a network similarity and cultural similarity and a parent comment, out of which there are 7478 unique user pairs and 474 repeated user pairs (5.9 percent repeats)

.

.

In [1]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cuda.


In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


---
# Map each comment to a network similarity and the cultural similarity<br>
###### ( network similarity = the cosine similarity between node embeddings of the parent node and the sender node of this comment)<br>
######( cultural similarity = the average cosine similarity between node embeddings of the parent node and the sender node of this comment)

---

read the output file which has the network similarity for unique user pairs

In [3]:
import pandas as pd
import numpy as np
data_askscience_subreddits_method2c_node2vec = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_subreddits_method2c_node2vec.csv', low_memory=False)
print(len(data_askscience_subreddits_method2c_node2vec))
print(len(pd.unique(data_askscience_subreddits_method2c_node2vec['from_user'])))
print(len(pd.unique(data_askscience_subreddits_method2c_node2vec['to_user'])))
print(len(data_askscience_subreddits_method2c_node2vec.columns))
print(data_askscience_subreddits_method2c_node2vec.head(3))

11100
6063
5068
4
      from_user         to_user  edgeweight_method2c  cosine_similarity
0     --tenet--   automoderator             0.011236           0.754318
1  -1kingkrool-        web-dude             0.333333           0.984803
2      -banned-  stimulatedecho             1.000000           0.978135


##### note: The column 'cosine_similarity' is actually the 'Network Similarity'

read the output file which has the cultural similarity for unique user pairs

In [4]:
data_askscience_user_pair_level_culsim = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_user_pair_level_culsim.csv', low_memory=False, index_col=0)
print(len(data_askscience_user_pair_level_culsim))
print(len(pd.unique(data_askscience_user_pair_level_culsim['from_user'])))
print(len(pd.unique(data_askscience_user_pair_level_culsim['to_user'])))
print(len(data_askscience_user_pair_level_culsim.columns))
print(data_askscience_user_pair_level_culsim.head(3))

11100
6063
5068
3
      from_user         to_user  cultural_similarity
0     --tenet--   automoderator             0.118791
1  -1kingkrool-        web-dude             0.604204
2      -banned-  stimulatedecho             0.198554


there is no difference between the number of user pairs for which cultural similarity opr network similarity was obtained. (Incase there is it could be due to the fact that - this difference in the number of user pairs between both output files is because while calculating cultural similarity, the comments with missing body, or body with the text '[deleted]' was dropped)

Thus 11100 unique user pairs were found in the data of askscience subreddit.<br>

Function to map each comment to a network similarity and a cosine similarity (if a valid parent comment is found)

In [5]:
def similarity_mapper(input_data):

  input_data['network_similarity'] = np.nan
  input_data['cultural_similarity'] = np.nan
  input_data['parent_comment_author'] = np.nan

  ignore_comments_counter = 0
  j = 0

  #additional code to resolve an error
  type_base = type(input_data['parent_id'].iloc[0])

  for ind, row in input_data.iterrows():
    j += 1
    if j % 10000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    curr_author = row['author']
    #curr_subreddit_id = row['subreddit_id'] we dont need subreddit_id here
    curr_id = row['id']
    if type(row['parent_id']) != type_base:
      ignore_comments_counter
      continue
    #curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters
    curr_parent_comment_id = row['parent_id'] #parent_id is not preceded by those 3 characters as before

    #find the parent comment
    #identify if a comment is a primary comment
    primary_comment_flag = 0
    if row['parent_id'] == row['link_id']: #it is a primary comment
      #print("primary comment")
      primary_comment_flag = 1
    if(len(input_data[input_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
      ignore_comments_counter += 1
      continue
    #if the parent was found
    primary_comment_flag = 0
    curr_parent = input_data[input_data['id'] == curr_parent_comment_id]['author'].values[0]

    #add the parent
    input_data.at[ind,'parent_comment_author'] = curr_parent

    #find the network similarity for this comment from the user pairs data
    if (len(data_askscience_subreddits_method2c_node2vec[(data_askscience_subreddits_method2c_node2vec['from_user'] == curr_author) & (data_askscience_subreddits_method2c_node2vec['to_user'] == curr_parent)]['cosine_similarity'].values) !=0 ):
      net_sim = data_askscience_subreddits_method2c_node2vec[(data_askscience_subreddits_method2c_node2vec['from_user'] == curr_author) & (data_askscience_subreddits_method2c_node2vec['to_user'] == curr_parent)]['cosine_similarity'].values[0]
      input_data.at[ind,'network_similarity'] = net_sim
    else:
      continue #could not find the network similarity

    #find the cultural similarity for this comment from the user pairs data
    if (len(data_askscience_user_pair_level_culsim[(data_askscience_user_pair_level_culsim['from_user'] == curr_author) & (data_askscience_user_pair_level_culsim['to_user'] == curr_parent)]['cultural_similarity'].values) != 0):
      cul_sim = data_askscience_user_pair_level_culsim[(data_askscience_user_pair_level_culsim['from_user'] == curr_author) & (data_askscience_user_pair_level_culsim['to_user'] == curr_parent)]['cultural_similarity'].values[0]
      input_data.at[ind,'cultural_similarity'] = cul_sim
    else:
      continue #could not find the cultural similarity


  print('total number of comments ignored: ' +str(ignore_comments_counter))
  return input_data


read the askscience subreddit data and call the function to find the network similarity, cultural similarity, and the parent author for each comment.

**continue from the next block if you have already executed 'Submissions_Processing_askscience.ipynb' and have a processed file called 'data_merged_askscience.csv'.** <br>

In [None]:
import pandas as pd
data_askscience = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience.csv', low_memory=False)
print(len(data_askscience)) #length of data = 26605
print(len(pd.unique(data_askscience['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_askscience['id']))) #unique number of comments = 26605 #the data is at the comment level
print(len(pd.unique(data_askscience['parent_id']))) #number of parent nodes = 10538
print(len(pd.unique(data_askscience['link_id']))) #number of submissions = 3004
print(len(pd.unique(data_askscience['author']))) #number of submissions = 6629
print(len(data_askscience.columns)) # = 12

26605
1
26605
10538
3004
6629
12


continue from the here if you have already executed 'Submissions_Processing_askscience.ipynb' and have a processed file called 'data_merged_askscience.csv'.<br>

In [6]:
import pandas as pd
data_askscience = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_merged_askscience.csv', low_memory=False)
print(len(data_askscience)) #length of data =
print(len(pd.unique(data_askscience['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_askscience['id']))) #unique number of comments = #the data is at the comment/submission level
print(len(pd.unique(data_askscience['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_askscience['link_id']))) #number of submissions =
print(len(pd.unique(data_askscience['author']))) #number of submissions =
print(len(data_askscience.columns)) # = 12

18949
1
18949
7790
2867
10895
17


In [7]:
similarity_askscience_subreddits = similarity_mapper(data_askscience)
print(len(similarity_askscience_subreddits)) #length of data =
print(len(pd.unique(similarity_askscience_subreddits['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(similarity_askscience_subreddits['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(similarity_askscience_subreddits['parent_id']))) #number of parent nodes =
print(len(pd.unique(similarity_askscience_subreddits['link_id']))) #number of submissions =
print(len(similarity_askscience_subreddits.columns))

finished comment 10000/18949
total number of comments ignored: 6827
18949
1
18949
7790
2867
20


number of unique author-parent comment author values in resulting set

In [8]:
print(len(similarity_askscience_subreddits[['author', 'parent_comment_author']].value_counts())) #number of user-pairs.

11404


the number of unique user pairs in the entire data is 11404.

In [9]:
similarity_askscience_subreddits.head(3)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,parent_id,link_id,controversiality,is_submitter,network_similarity,cultural_similarity,parent_comment_author
0,xsjqzy,askscience,Why do I poop after a glass or two of beer or ...,depressedchiq,1,0,1664591533,1665426496,/r/askscience/comments/xsjqzy/why_do_i_poop_af...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True,,,
1,xsju1p,askscience,If you heated water under immense pressure so ...,zhongliabuse,1,0,1664591793,1665426492,/r/askscience/comments/xsju1p/if_you_heated_wa...,1,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True,,,
2,xsjx0l,askscience,What is the evolutionary goal of nose elongati...,_ozeki,1,0,1664592050,1665426489,/r/askscience/comments/xsjx0l/what_is_the_evol...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True,,,


In [10]:
print(similarity_askscience_subreddits['network_similarity'].isna().sum())
print(similarity_askscience_subreddits['cultural_similarity'].isna().sum())
print(similarity_askscience_subreddits['parent_comment_author'].isna().sum())

7167
7167
6827


save 'similarity_askscience_subreddits' to a csv file, this is all the comments in the data set, but only some of them have network similarity and cultural similarity measures

In [11]:
similarity_askscience_subreddits.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/similarity_askscience_subreddits.csv')

In [None]:
import pandas as pd
similarity_fifteen_subreddits = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/similarity_askscience_subreddits.csv', low_memory=False)
print(len(similarity_askscience_subreddits))

26605


In [12]:
data_similarity_askscience_subreddits = similarity_askscience_subreddits[~similarity_askscience_subreddits['network_similarity'].isna()]
print(len(data_similarity_askscience_subreddits))

11782


In [13]:
data_similarity_askscience_subreddits = data_similarity_askscience_subreddits[~similarity_askscience_subreddits['cultural_similarity'].isna()]
print(len(data_similarity_askscience_subreddits))

11782


  data_similarity_askscience_subreddits = data_similarity_askscience_subreddits[~similarity_askscience_subreddits['cultural_similarity'].isna()]


In [14]:
data_similarity_askscience_subreddits = data_similarity_askscience_subreddits[~similarity_askscience_subreddits['parent_comment_author'].isna()]
print(len(data_similarity_askscience_subreddits))

11782


  data_similarity_askscience_subreddits = data_similarity_askscience_subreddits[~similarity_askscience_subreddits['parent_comment_author'].isna()]


In [15]:
print(len(data_similarity_askscience_subreddits))
data_similarity_askscience_subreddits.head(3)

11782


Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,parent_id,link_id,controversiality,is_submitter,network_similarity,cultural_similarity,parent_comment_author
5679,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,1664960533,0,0,0,0,0,iqkee0k,xs73nx,0,False,0.985648,0.318494,chop1n
5681,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,1664960508,0,0,0,0,0,iqke0xc,xs1k1y,0,False,0.957111,0.642043,feitingen
5682,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,1664960507,0,0,0,0,0,iqker6l,xs73nx,0,False,0.985648,0.421561,omniskeptic


note: this means that 11782 comments have a cultural, network similairty as well as a parent comment

now check how many unique values of user-parent pairs are there

In [16]:
print(len(data_similarity_askscience_subreddits[['author', 'parent_comment_author']].value_counts()))

11100


that means there are 11782-11100 =  682 repeated pairs of user-parent

In [17]:
print(len(data_similarity_askscience_subreddits)) #length of data =
print(len(pd.unique(data_similarity_askscience_subreddits['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_similarity_askscience_subreddits['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data_similarity_askscience_subreddits['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_similarity_askscience_subreddits['link_id']))) #number of submissions =
print(len(data_similarity_askscience_subreddits.groupby(['author', 'parent_comment_author']).size().reset_index(name='Freq'))) #print the unique number of user pairs 16370-9874 = 6496
print(len(data_similarity_askscience_subreddits.columns))
print(data_similarity_askscience_subreddits.head(5))

11782
1
11782
6725
2377
11100
20
           id   subreddit                                               body  \
5679  iqker6l  askscience  No it does not imply that. “We don’t yet know”...   
5681  iqkfl8j  askscience  Pasteurization works by heating (generally a l...   
5682  iqkfmj9  askscience  It *absolutely* implies an expectation, even i...   
5692  iqkpm44  askscience  Thank you for your submission! Unfortunately, ...   
5701  iqkrd5j  askscience  Thats also what I remember. There was speculat...   

                  author  score  gilded  created_utc  retreived_on permalink  \
5679         omniskeptic      2       0   1664582942    1664960533    000000   
5681      jeweledjuniper     11       0   1664583360    1664960508    000000   
5682              chop1n      3       0   1664583378    1664960507    000000   
5692  askscience-modteam      1       0   1664588425    1664960199    000000   
5701           greese007      2       0   1664589335    1664960145    000000   

     

Confirm that these columns do not have any missing values

In [18]:
print(data_similarity_askscience_subreddits['network_similarity'].isna().sum())
print(data_similarity_askscience_subreddits['cultural_similarity'].isna().sum())
print(data_similarity_askscience_subreddits['parent_comment_author'].isna().sum())
print(data_similarity_askscience_subreddits['body'].isna().sum())

0
0
0
0


In [19]:
data_similarity_askscience_subreddits.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_similarity_askscience_subreddits.csv')